Systems Infrastructure for Data Science Web Science Group Uni - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13

Data Stream Processing

Topics • Model Issues • System Issues • Distributed Processing • Web-Scale Streaming Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 3

Data Streams • Continuous sequences of data elements that are typically: – Push-based (data flow controlled by sources) – Ordered (e.g., by arrival time, or by explicit timestamps) – Rapid (e.g., ~ 100K messages/second in market data) – Potentially unbounded (may have no end) – Time-sensitive (usually representing real-time events) – Time-varying (in content and speed) – Unpredictable (autonomous data sources) Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 4

Example Applications • Financial Services Example:  Trades(time, symbol, price, volume) Typical Applications:  Algorithmic Trading  Foreign Exchange  Fraud Detection  Compliance Checking Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 5

Financial Services: Skyrocketing Data Rates OPRA Message Traffic Projections 1.000.000 Messages per Second (mps) 907.000 800.000 701.000 600.000 573.000 456.000 400.000 359.000 190.000 200.000 122.000 88.000 149.000 75.000 110.000 0 Date [ Source: Options Price Reporting Authority, http://www.opradata.com ] Some more up-to-date rates from http://www.marketdatapeaks.com/: • 4 M mps on January 25, 2013 • 6.65 M mps on October 7, 2011 Low response time critical (think high frequency trading)! Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 6

Example Applications • System and Network Monitoring Example:  Connections(time, srcIP, destIP, destPort, status) Typical Applications:  Server load monitoring  Network traffic monitoring  Detecting security attacks  Denial of Service  Intrusion Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 7

Network Monitoring: Bursty Data Rates [ Source: Internet Traffic Archive, http://ita.ee.lbl.gov/ ] Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 8

Example Applications • Sensor-based Monitoring Example:  CarPositions(time, id, speed, position) Typical Applications:  Monitoring congested roads  Route planning  Rule violations  Tolling Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 9

Historical Background • 1990s: Various extensions to traditional database systems – Triggers in Active DB’s, Sequence DB’s, Continuous Queries, Pub/Sub, etc. • Early 2000s: Data Stream Management Systems – Aurora [Brandeis-Brown-MIT] – STREAM [Stanford] – TelegraphCQ [UC Berkeley] – Many others (NiagaraCQ, Gigascope, Nile, PIPES, …) • 2003: Start-ups – Aurora -> StreamBase, Inc. -> Borealis (= distributed Aurora) – STREAM -> Coral8, Inc. • 2005: More Start-ups – TelegraphCQ -> Truviso, Inc. • Today: Growing industry interest and standardization efforts Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 10

A Paradigm Shift in Data Processing Model Data Answer Query Answer DSMS DBMS Query Data Base Base Traditional Data Management Data Stream Management Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 11

DBMS vs. DSMS • Persistent relations • Transient streams • Read-intensive • Update-intensive • One-time queries • Continuous queries ( a.k.a., long-running, standing, or persistent queries ) • Sequential access • Random access • Unpredictable data • Access plan determined characteristics and arrival by query processor and patterns physical DB design Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 12

Model Issues • Data models – Relational-based vs. XML-based vs Object-based – Time and Order • Query models – Declarative vs. Procedural – Window-based Processing Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 13

Example Models • STREAM / CQL [ Stanford ] – Relational-based data model – Declarative query language (SQL extensions) • Aurora / SQuAl [ Brandeis-Brown-MIT ] – Relational-based data model – Procedural query language (Relational algebra extensions) • MXQuery [ ETH Zurich ] – XML-based data model – Declarative query language (XQuery extensions) Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 14

Window-based Processing • Windows are finite excerpts of a potentially unbounded stream. • Most streaming applications are interested in the readings of the recent past. • Windows help us unblock operators such as aggregates. • Windows help us bound the memory usage for operators such as joins. Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 15

Window Example • Two basic parameters: size and slide • Example: Trades(time, symbol, price, volume) size = 10 min (10:00, “IBM”, 20, 100) (10:00, “INTC”, 15, 200) (10:00, “MSFT”, 22, 100) slide by 5 min (10:05, “IBM”, 18, 300) (10:05, “MSFT”, 21, 100) (10:10, “IBM”, 18, 200) (10:10, “MSFT”, 20, 100) (10:15, “IBM”, 20, 100) (10:15, “INTC”, 20, 200) (10:15, “MSFT”, 20, 200) . . Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 16

Windows: Unblocking Aggregate Operation • Problem: ….. 30 15 30 20 10 30 No results can be produced Average until the stream ends.  Average is “blocked”. • Solution: Average ..... 30 15 30 20 10 30 .. 25 20 Average can be computed size = 3 on sliding windows. slide = 3  Average is “unblocked”. Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 17

Windows: Bounding Join State • Problem: ….. 20 10 30 Join must buffer its inputs ….. (10, 10) (30, 30) Join until both streams end. ….. 10 15 30  Join state is “unbounded”. ….. (10, 10) (30, 30) • Solution: ….. 20 10 30 Join Join must only buffer the size = 2 latest window on its inputs. ….. 10 15 30  Join state is “bounded”. Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 18

STREAM CQL: C ontinuous Q uery L anguage • SQL for Relation-to-Relation operations • Additionally: – “Stream” as a new data type (in addition to “Relation”) – Continuous instead of one-time query semantics – Stream-to-Relation operations: • Window specifications derived from SQL-99 – Relation-to-Stream operations: • Three special operators: Istream, Dstream, Rstream – Simple sampling operations on streams Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 19

CQL: Streams vs. Relations • T: discrete, ordered time domain • A stream S is a possibly infinite bag of elements <s, t>, where s is a tuple with the schema of S and t є T is the timestamp of the element. – Note: Timestamp is not part of the tuple schema! • A relation R is a mapping from each time instant in T to a finite but unbounded bag of tuples with the schema of R. Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 20

CQL: Continuous Query Semantics • Time “advances” from t-1 to t, when all inputs up to t-1 have been processed. • For a query producing a stream: – At time t є T, all inputs up to t are processed and the continuous query emits any new stream result elements with timestamp t. • For a query producing a relation: – At time t є T, all inputs up to t are processed and the continuous query updates the output relation to state R(t). Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 21

CQL: Mappings between Streams and Relations Stream-to-Relation Relation-to-Relation Streams Relations Relation-to-Stream  Stream-to-Stream = Stream-to-Relation + Relation-to-Stream Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 22

CQL: Stream-to-Relation Operators • Time-based sliding windows – FROM S[RANGE T] • Tuple-based sliding windows – FROM S[ROWS N] • Partitioned windows – FROM S[PARTITION BY A 1 , …, A k RANGE T] – FROM S[PARTITION BY A 1 , …, A k ROWS N] • Windows with a “slide” parameter – FROM S[RANGE T SLIDE L] – FROM S[ROWS N SLIDE L] – FROM S[PARTITION BY A 1 , …, A k RANGE T SLIDE L] – FROM S[PARTITION BY A 1 , …, A k ROWS N SLIDE L] Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 23

CQL: Relation-to-Stream Operators • Insert stream = − − ×  Istream R ( ) (( ( ) R t R t ( 1)) { }) t ≥ t 0 • Delete stream = − − ×  Dstream R ( ) (( ( R t 1) R t ( )) { }) t > t 0 • Relation stream = ×  Rstream R ( ) ( ( ) { }) R t t ≥ t 0 • SELECT Istream(..), SELECT Dstream(..), SELECT Rstream(..) Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 24

CQL: Example Queries Trades (time, symbol, price, volume) NYSE_Trades (time, symbol, price, volume) SWX_Trades (time, symbol, price, volume)  Streaming Filter  Streaming Aggregation SELECT Istream(*) SELECT Istream(Count(*)) FROM Trades[RANGE Unbounded] FROM Trades[PARTITION BY symbol WHERE price > 20 RANGE 10 Minutes SLIDE 1 Minute]  Sliding-window Join SELECT Istream(*) FROM NYSE_Trades[RANGE 10 Minutes], SWX_Trades[RANGE 10 Minutes] WHERE NYSE_Trades.symbol = SWX_Trades.symbol Uni Freiburg, WS2012/13 Systems Infrastructure for Data Science 25

Systems Infrastructure for Data Science Web Science Group Uni - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Topics Model Issues System Issues Distributed Processing Web-Scale Streaming Uni Freiburg, WS2012/13 Systems

Cyber- -Science Infrastructure: Science Infrastructure: Cyber Cyber-Science Infrastructure:

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture I:

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Medical Infrastructure in Medical Infrastructure in Medical Infrastructure in Medical

What can Infrastructure do for you today? Daniel Humbedooh Gruno Infrastructure Architect,

Lecture 23 Verified Systems Software Infrastructure is Shaky Software Infrastructure is Shaky

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Compiler Infrastructure Systems and Internet Infrastructure Security (SIIS) Laboratory Page 1

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream

Types of Expert Systems Interpretation Systems Prediction Systems Diagnosis Systems

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Selecting Least Cost Green Infrastructure James W. Ridgway, PE October 14, 2015 Integrated

Infrastructure Solutions MSD 2250R Infrastructure Solutions Background: Infrastructure

Infrastructure & Shared Services Director Infrastructure & Shared Services Organisational

Broadband Infrastructure in Broadband Infrastructure in North Asia and Central Asia North Asia and

Bicycle Infrastructure 1st of 2 presentations about Bike Infrastructure This Month: A Picture of

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

ATLAS I/O Overview Peter van Gemmeren (ANL) gemmeren@anl.gov for many in ATLAS 8/23/2018 Peter

CYBERSECURITY STRATEGIES TO MANAGE BUSINESS RISKS A C O N V E R S A T I O N W I T H H O R N E

Learning to Hash with its Application to Big Data Retrieval and Mining o Department of

CS 473: Algorithms Chandra Chekuri Ruta Mehta University of Illinois, Urbana-Champaign Fall

Logical Foundations of Continuous Query Languages for Data Streams Carlo Zaniolo Carlo Zaniolo

COL106: Data Structures and Algorithms Ragesh Jaiswal, IITD Ragesh Jaiswal, IITD COL106: Data

Efficient Data Structures for the Factor Periodicity Problem Tomasz Kociumaka Jakub Radoszewski

Sambuz

Useful Links

Newsletter

Mail Us

Systems Infrastructure for Data Science Web Science Group Uni - PowerPoint PPT Presentation

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Topics Model Issues System Issues Distributed Processing Web-Scale Streaming Uni Freiburg, WS2012/13 Systems

Cyber- -Science Infrastructure: Science Infrastructure: Cyber Cyber-Science Infrastructure:

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture I:

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Medical Infrastructure in Medical Infrastructure in Medical Infrastructure in Medical

What can Infrastructure do for you today? Daniel Humbedooh Gruno Infrastructure Architect,

Lecture 23 Verified Systems Software Infrastructure is Shaky Software Infrastructure is Shaky

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Compiler Infrastructure Systems and Internet Infrastructure Security (SIIS) Laboratory Page 1

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream

Types of Expert Systems Interpretation Systems Prediction Systems Diagnosis Systems

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Selecting Least Cost Green Infrastructure James W. Ridgway, PE October 14, 2015 Integrated

Infrastructure Solutions MSD 2250R Infrastructure Solutions Background: Infrastructure

Infrastructure &amp; Shared Services Director Infrastructure &amp; Shared Services Organisational

Broadband Infrastructure in Broadband Infrastructure in North Asia and Central Asia North Asia and

Bicycle Infrastructure 1st of 2 presentations about Bike Infrastructure This Month: A Picture of

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

ATLAS I/O Overview Peter van Gemmeren (ANL) gemmeren@anl.gov for many in ATLAS 8/23/2018 Peter

CYBERSECURITY STRATEGIES TO MANAGE BUSINESS RISKS A C O N V E R S A T I O N W I T H H O R N E

Learning to Hash with its Application to Big Data Retrieval and Mining o Department of

CS 473: Algorithms Chandra Chekuri Ruta Mehta University of Illinois, Urbana-Champaign Fall

Logical Foundations of Continuous Query Languages for Data Streams Carlo Zaniolo Carlo Zaniolo

COL106: Data Structures and Algorithms Ragesh Jaiswal, IITD Ragesh Jaiswal, IITD COL106: Data

Efficient Data Structures for the Factor Periodicity Problem Tomasz Kociumaka Jakub Radoszewski

Sambuz

Useful Links

Newsletter

Mail Us

Infrastructure & Shared Services Director Infrastructure & Shared Services Organisational

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams