Apache Flink
Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske
1
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian - - PowerPoint PPT Presentation
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream and batch processing Easy and
1
2
3
4
Table API Gelly Library ML Library
Apache SAMOA
Optimizer DataSet API (Java/Scala) DataStream API (Java/Scala) Stream Builder Runtime Local Cluster
Yarn Apache Tez
Embedded
Apache MRQL Dataflow HDFS S3 JDBC HCatalog
Apache HBase Apache Kafka Apache Flume
RabbitMQ Hadoop IO ...
Data Sources Environments Flink Core Libraries
20 40 60 80 100 120 Nov.10 Apr.12 Aug.13 Dec.14
#unique git committers (w/o manual de-dup)
5
Data Streams
“Historic” data
Analytical Workloads
6
Programming Model & APIs
7
8
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<String> input = env.readTextFile(input); DataSet<String> first = input .filter (str -> str.contains(“Apache Flink“)); DataSet<String> second = first .map(str -> str.toLowerCase()); second.print(); env.execute();
Input First Second filter map
9
– map, flatMap, filter, project
– groupBy, reduce, reduceGroup, combineGroup, mapPartition, aggregate, distinct
– join, coGroup, union, cross
– iterate, iterateDelta
– rebalance, partitionByHash, sortPartition
– Window, windowMap, coMap, ...
10
11
12
case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Count.of(1000)).every(Count.of(100)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()
DataSet API (batch): DataStream API (streaming):
val orders = env.readCsvFile(…) .as('oId, 'oDate, 'shipPrio) .filter('shipPrio === 5) val items = orders .join(lineitems).where('oId === 'id) .select('oId, 'oDate, 'shipPrio, 'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue) val result = items .groupBy('oId, 'oDate, 'shipPrio) .select('oId, 'revenue.sum, 'oDate, 'shipPrio)
13
14
Processing Engine
15
Cost-based
Type extraction stack Memory manager Out-of-core algos Task scheduling Recovery metadata Data serialization stack
Client (pre-flight) Master Workers
...
Flink Program
...
Pipelined or Blocking Data Transfer Coordination
16
17
Pipelined Data Transfer
18
– Flink‘s runtime has always done stream processing – Operators pipeline data forward as soon as it is processed – Some operators are blocking (such as sort)
– Evolving very quickly under heavy development
19
20
Large Input Small Input Small Input Result join Large Input map Interm. DataSet Build HT Result
Pipeline 1 Pipeline 2
join Probe HT map
21
Memory Management and Out-of-Core Algorithms
22
– OutOfMemoryErrors due to data objects on the heap
– C++-style memory management – Robust out-of-core algorithms
23
24
25
Native Data Flow Iterations
26
– Loops are not unrolled – But executed as cyclic data flows
– Bulk iterations – Delta iterations
27
2 1 5 4 3
0.1 0.5 0.2 0.4 0.7 0.3 0.9
– Operators are scheduled once – Data is fed back through backflow channel – Loop-invariant data is cached
initial result interm. result result reduce join interm. result
datasets Replace
28
5000000 10000000 15000000 20000000 25000000 30000000 35000000 40000000 45000000 1 6 11 16 21 26 31 36 41 46
# of elements updated # of iterations
29
30 Iterations 61 Iterations (Convergence)
30
Roadmap
31
32
34
35
36
37
– Task chaining – Join algorithms – Re-use partitioning and sorting for later operations – Caching for iterations
38
val orders = … val lineitems = … val filteredOrders = orders .filter(o => dataFormat.parse(l.shipDate).after(date)) .filter(o => o.shipPrio > 2) val lineitemsOfOrders = filteredOrders .join(lineitems) .where(“orderId”).equalTo(“orderId”) .apply((o,l) => new SelectedItem(o.orderDate, l.extdPrice)) val priceSums = lineitemsOfOrders .groupBy(“orderDate”) .sum(“l.extdPrice”);
39
DataSource
Filter DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe broadcast forward Combine Reduce sort[0,1] DataSource
Filter DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe hash-part [0] hash-part [0] hash-part [0,1] Reduce sort[0,1]
Best plan depends on relative sizes
partial sort[0,1]
40