Continuous Intelligence
Through Computation Sharing With Arcon
Paris Carbone Senior Researcher @ RISE Committer @ Apache Flink <paris.carbone@ri.se> Castor Software Days
Continuous Intelligence Through Computation Sharing With Arcon - - PowerPoint PPT Presentation
Continuous Intelligence Through Computation Sharing With Arcon Paris Carbone Senior Researcher @ RISE Committer @ Apache Flink <paris.carbone@ri.se> Castor Software Days E n Data g i n Science e e r i n g Tech Business A
Through Computation Sharing With Arcon
Paris Carbone Senior Researcher @ RISE Committer @ Apache Flink <paris.carbone@ri.se> Castor Software Days
Business
A Lot is going on in Tech (Deep Learning, Scalable Processing etc.) Little contribution to critical real-time decision making
Data Science E n g i n e e r i n g
3
A design pattern in which real-time analytics are integrated within a business operation, processing current and historical data to prescribe actions in response to events.
Business
https://www.gartner.com/en/newsroom/press-releases/2019-02-18-gartner-identifies-top-10-data-and-analytics-technolo
events actions
4
Data
Queries retrospective answers
Query
lots of Data real-time answers
paradigm shift
5
Stream SQL, CEP… Kafka, Pub/Sub, Kinesis, Pravega… Flink, Beam, Kafka-Streams, Apex, Storm
Storage Compute High Level Models
The Stream Analytics Stack
service
24/7 applications/services have always been event-driven e.g., using actor programming events
01110011100001001000100010010001 000100110010 000100110010 000100110
logic
6
net socket
Data Stream Computing Actor Programming
service
logic
service
logic state logi logi logi logi logi logic logic logic logi logic state
Declarative Program
service
Stream SQL, CEP… Kafka, Pub/Sub, Kinesis, Pravega… Flink, Beam, Kafka-Streams, Apex, Storm, Spark Streaming… Storage Compute High Level Models
8
9
commercial deployments
Data Streams,Fault Tolerance, Window Aggregation
Calcite stream-SQL
influenced
Event Logs Historic Data Event Logs Files Applications/Services Stream Processing
State
11
Dataflow Engine
Event Processing API
f(input, state, time)
DataStream API
window,map,filter etc.
SQL, CEP, Tables, ML
Automates
Domain-Specific APIs
Data Scientists Data Engineers
12
Average Tip per Hour with Stream SQL Completed Taxi Rides within 120min with Complex Event Processing
SELECT HOUR(r.rideTime) AS hourOfDay, AVG(f.tip) AS avgTip FROM Rides r, Fares f WHERE r.rideId = f.rideId AND NOT r.isStart AND f.payTime BETWEEN r.rideTime - INTERVAL '5' MINUTE AND r.rideTime GROUP BY HOUR(r.rideTime);
val completedRides = Pattern .begin[TaxiRide]("start").where(_.isStart) .next(“end").where(!_.isStart) CEP.pattern[TaxiRide](allRides, completedRides.within(Time.minutes(120)))
https://www.flink-forward.org/
Source:
14
AthenaX - An Online Warehousing Platform (2017)
AthenaX UberEats UberEats UberEats Restaurants Users real-time estimations
estimated delivery? https://eng.uber.com/athenax/ A stream SQL query optimiser and executor based on Flink https://github.com/uber/AthenaX
AthenaX was released and open sourced by Uber Technologies. It is capable of scaling across hundreds of machines and processing hundreds of billions of real-time events daily.
event streams
15
Marketplace - Dynamic Ride Pricing with Apache Flink (2018)
https://marketplace.uber.com/
Flink Forward 2018
Compute Location-Sensitive Trends in Rider Demand and Driver Availability Prices
Geo-Sensitive Time-based Aggregations million events per sec Input Streams Output Decisions
16
17
too many too few
Flink Pipeline
18
Data Processing
Data Streams
but what about deeper analytics…
19
⋈ ⋈ ⋈ σθ σθ σθ σθ π π
Feature Learning Tensor Programming Dynamic Graphs
Simulation tasks Reasoning Feature Engineering Model Serving
20
Event Logic
live data
ML Historic Model
historic data
streaming
batch
Framework/Library Silos Fragmented Codebases/Runtimes Unshared Hardware Over-materialization of results Ridiculously Unoptimised Programs No continuous intelligence
features, aggregates,ETL model serving
21
Event Logic
live data
ML Historic Model
historic data
streaming
batch
?
critical decision making
Live Model
features, aggregates,ETL model serving
22
Arcon Arcon Arcon
23
Tensors DataFrames DataStreams Graphs Optimise and Generate Code Cross-Compile
Unified Declarative Programming Shared Native Execution
24
Unified Analytics DSL Arcon Runtime Arc IR (Intermediate Representation)
25
Arc IR
Translation Core DSL …
Data Streams Linear Algebra Relational Algebra
σθ σθ π
⋈
26
#Frameworks Performance
f1 f2 f3
is possible, e.g. resource sharing
costs ( )
f1 f2 f3
IR IR IR
f1 + f2 + f3
IR
27
Arcon Arc (High Level IR) Unified Analytics DSL Logical Dataflow IR Physical Dataflow IR Binaries
28
Read More
[Paper] Arc: An IR for Batch and Stream Programming @ DBPL19 [Code] https://github.com/cda-group/arc
29
30
ne k)
T a In In 100 101 102 103 Execution Time (seconds)
x2 orders of magnitude faster
10M elements 50 map operations
Arc (High Level IR) Logical Dataflow IR Physical Dataflow IR Binaries
Arc can boost even existing frameworks
31
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications SOCC 2019 Garefalakis, Karanasos, Pietzuch
IO - Channels / State IO - Channels / State Dynamic Scheduler Dynamic Scheduler
Appmaster Statemaster
dataflow deployment
Static Dynamic
control data snapshots …
Operational Plane Execution Plane workers
Hadoop Spark Flink Arcon Neptune Ray Storm
Flexible State Backends (external/shared, embedded)
32
33
https://cda-group.github.io https://github.com/cda-group/arcon https://github.com/cda-group/arc Code: Project:
https://twitter.com/SenorCarbone