Peter R. Pietzuch
prp@doc.ic.ac.uk
Drinking From The Fire Hose: Scalable Stream Processing Systems
Peter Pietzuch
Large-Scale Distributed Systems Group
http://lsds.doc.ic.ac.uk Cambridge MPhil – November 2014 Department of Computing prp@doc.ic.ac.uk
Drinking From The Fire Hose: Scalable Stream Processing Systems - - PowerPoint PPT Presentation
Department of Computing Drinking From The Fire Hose: Scalable Stream Processing Systems Peter Pietzuch prp@doc.ic.ac.uk Large-Scale Distributed Systems Group Peter R. Pietzuch http://lsds.doc.ic.ac.uk prp@doc.ic.ac.uk Cambridge MPhil
Large-Scale Distributed Systems Group
http://lsds.doc.ic.ac.uk Cambridge MPhil – November 2014 Department of Computing prp@doc.ic.ac.uk
2
3
– Road authorities, traffic planners, emergency services, commuters – But access not everything: Privacy
– “What is the best time/route for my commute through central London between 7-8am?”
(Cambridge)
4
5
6
7
8
Data
Index
9
Queries
Working Storage
10
11
[Golab & Ozsu (SIGMOD 2003)]
12
13
id temp rain id temp rain id temp rain id temp rain id temp rain id temp rain id temp rain id temp rain id temp rain id temp rain
time
id = 27182 temp = 24 C rain = 20mm
sensor output Sensors data stream
t1 t2 t3 t4 ...
14
Window specification Special operators: Istream, Dstream, Rstream Any relational query
window
15
temp rain temp rain temp rain temp rain temp rain temp rain temp rain temp rain temp rain temp rain
now
16
window
temp rain temp rain temp rain temp rain temp rain temp rain temp rain temp rain temp rain temp rain
s
17
18
19
20
21 Source: Golab & Ozsu 2003
22
23
Source: STREAM project
24
tuples to drop: c.f. result correctness and resource relief
25
26
27
28 Scientific instruments Traffic monitors Mobile sensing devices Queries RFID tags Body sensor networks Queries
29
30 0% 20% 40% 60% 80% 100% 09/07 09/08 09/09 09/10 09/11 09/12 09/13 Utilisation Date
Courtesy of MSRC
0% 50% 100% 09/07 09/08 09/09 09/10 09/11 09/12 09/13
31
32
Google, USENIX OSDI’04
Sanjay Ghemawat Jeff Dean
partitioned data on distributed file system $2 billion market revenue (2013)
33
34
Existing systems Hard for complex algorithms Hard for all algo- rithms
GBs TBs PBs EBs days hours mins secs millisecs
35
Berkeley, ACM SOSP’13
36
Imperial, ACM SIGMOD’13
Dataflow graph
Rating: 3 User A Item: “iPad” Rating: 5 User A Recommend: “iPhone”
Customer activity
Up-to-date recommendations
GBs to TBs in size User A Item 2 User B Item 1 2 4 1 5
38
Matrix userItem = new Matrix(); Matrix coOcc = new Matrix(); void addRating(int user, int item, int rating) { userItem.setElement(user, item, rating); updateCoOccurrence(coOcc, userItem); } Vector getRecommendation(int user) { Vector userRow = userItem.getRow(user); Vector userRec = coOcc.multiply(userRow); return userRec; }
(@Partitioned, @Partial, @Global, …) Static program analysis
39
40
window
temp rain temp rain temp rain temp rain temp rain
41
Imperial, USENIX ATC’14
SE
User A Item 2 User B Item 1 2 4 1 5
42
User A Item 2 User B Item 1 2 4 1 5
43
Merge task Barrier
44
45
Data Feeder Balance Account* Forwarder Toll Calculator* Toll Assessment* Toll Collector Sink
[24 instances] [12 instances] [5 instances] [6 instances]
46
47
48
49
50
<prp@doc.ic.ac.uk> http://lsds.doc.ic.ac.uk