Apache Flink Fast and Reliable Large-Scale Data Processing Fabian - PowerPoint PPT Presentation

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1

What is Apache Flink? Distributed Data Flow Processing System • Focused on large-scale data analytics • Real-time stream and batch processing • Easy and powerful APIs (Java / Scala) • Robust execution backend 2

What is Flink good at? It‘s a general -purpose data analytics system • Real-time stream processing with flexible windows • Complex and heavy ETL jobs • Analyzing huge graphs • Machine-learning on large data sets • ... 3

Flink in the Hadoop Ecosystem Apache SAMOA Apache MRQL Gelly Library ML Library Table API Dataflow Libraries DataSet API (Java/Scala) DataStream API (Java/Scala) Flink Core Optimizer Stream Builder Runtime Environments Local Apache Tez Embedded Cluster Yarn HDFS Hadoop IO Apache HBase Apache Kafka Apache Flume Data Sources HCatalog JDBC S3 RabbitMQ ... 4

Flink in the ASF • Flink entered the ASF about one year ago – 04/2014: Incubation – 12/2014: Graduation • Strongly growing community 120 100 80 60 40 20 0 Nov.10 Apr.12 Aug.13 Dec.14 #unique git committers (w/o manual de-dup) 5

Where is Flink moving? A "use-case complete" framework to unify batch & stream processing Data Streams • Kafka Analytical Workloads • RabbitMQ • ETL • ... • Relational processing Flink • Graph analysis • Machine learning • Streaming data analysis “Historic” data • HDFS • JDBC • ... Goal: Treat batch as finite stream 6

Programming Model & APIs HOW TO USE FLINK? 7

Unified Java & Scala APIs • Fluent and mirrored APIs in Java and Scala • Table API for relational expressions • Batch and Streaming APIs almost identical ... ... with slightly different semantics in some cases 8

DataSets and Transformations Input filter First map Second ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet< String > input = env.readTextFile(input); DataSet< String > first = input . filter (str - > str.contains(“Apache Flink“)) ; DataSet< String > second = first . map(str -> str.toLowerCase()) ; second .print(); env.execute(); 9

Expressive Transformations • Element-wise – map, flatMap, filter, project • Group-wise – groupBy, reduce, reduceGroup, combineGroup, mapPartition, aggregate, distinct • Binary – join, coGroup, union, cross • Iterations – iterate, iterateDelta • Physical re-organization – rebalance, partitionByHash, sortPartition • Streaming – Window, windowMap, coMap, ... 10

Rich Type System • Use any Java/Scala classes as a data type – Tuples, POJOs, and case classes – Not restricted to key-value pairs • Define (composite) keys directly on data types – Expression – Tuple position – Selector function 11

Counting Words in Batch and Stream case class Word ( word : String, frequency : Int) DataSet API (batch): val lines: DataSet[String] = env.readTextFile(...) lines. flatMap {line => line.split(" ") .map(word => Word (word,1))} . groupBy ( "word" ). sum ( "frequency" ) .print() DataStream API (streaming): val lines: DataStream[String] = env.fromSocketStream(...) lines. flatMap {line => line.split(" ") .map(word => Word (word,1))} . window (Count.of(1000)). every (Count.of(100)) . groupBy ( "word" ). sum ( "frequency" ) .print() 12

Table API • Execute SQL-like expressions on table data – Tight integration with Java and Scala APIs – Available for batch and streaming programs val orders = env.readCsvFile (…) . as ( 'oId, 'oDate, 'shipPrio ) . filter ( 'shipPrio === 5 ) val items = orders . join (lineitems). where ( 'oId === 'id ) . select ( 'oId, 'oDate, 'shipPrio, 'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue ) val result = items . groupBy ( 'oId, 'oDate, 'shipPrio ) . select ( 'oId, 'revenue.sum, 'oDate, 'shipPrio ) 13

Libraries are emerging • As part of the Apache Flink project – Gelly: Graph processing and analysis – Flink ML: Machine-learning pipelines and algorithms – Libraries are built on APIs and can be mixed with them • Outside of Apache Flink – Apache SAMOA (incubating) – Apache MRQL (incubating) – Google DataFlow translator 14

Processing Engine WHAT IS HAPPENING INSIDE? 15

System Architecture Client (pre-flight) Master Type extraction Recovery stack metadata Flink Program Cost-based Task Workers optimizer scheduling Memory manager Coordination Data serialization stack Out-of-core ... ... algos Pipelined or Blocking Data Transfer 16

Cool technology inside Flink • Batch and Streaming in one system • Memory-safe execution • Built-in data flow iterations • Cost-based data flow optimizer • Flexible windows on data streams • Type extraction and serialization utilities • Static code analysis on user functions • and much more... 17

Pipelined Data Transfer STREAM AND BATCH IN ONE SYSTEM 18

Stream and Batch in one System • Most systems are either stream or batch systems • In the past, Flink focused on batch processing – Flink‘s runtime has always done stream processing – Operators pipeline data forward as soon as it is processed – Some operators are blocking (such as sort) • Stream API and operators are recent contributions – Evolving very quickly under heavy development 19

Pipelined Data Transfer • Pipelined data transfer has many benefits – True stream and batch processing in one stack – Avoids materialization of large intermediate results – Better performance for many batch workloads • Flink supports blocking data transfer as well 20

Pipelined Data Transfer Large Interm. map Input DataSet Program Small join Result Input Pipeline 2 Large No intermediate map Input materialization! Pipelined Execution Small Probe Build join Result HT Input HT Pipeline 1 21

Memory Management and Out-of-Core Algorithms MEMORY SAFE EXECUTION 22

Memory-safe Execution • Challenge of JVM-based data processing systems – OutOfMemoryErrors due to data objects on the heap • Flink runs complex data flows without memory tuning – C++-style memory management – Robust out-of-core algorithms 23

Managed Memory • Active memory management – Workers allocate 70% of JVM memory as byte arrays – Algorithms serialize data objects into byte arrays – In-memory processing as long as data is small enough – Otherwise partial destaging to disk • Benefits – Safe memory bounds (no OutOfMemoryError) – Scales to very large JVMs – Reduced GC pressure 24

Going out-of-core Single-core join of 1KB Java objects beyond memory (4 GB) Blue bars are in-memory, orange bars (partially) out-of-core 25

Native Data Flow Iterations GRAPH ANALYSIS 26

Native Data Flow Iterations • Many graph and ML algorithms require iterations • Flink features native data flow iterations – Loops are not unrolled – But executed as cyclic data flows 2 0.1 0.3 1 0.7 • Two types of iterations 5 0.4 0.5 – Bulk iterations 0.9 3 – Delta iterations 4 0.2 • Performance competitive with specialized systems 27

Iterative Data Flows • Flink runs iterations „natively“ as cyclic data flows – Operators are scheduled once – Data is fed back through backflow channel – Loop-invariant data is cached • Operator state is preserved across iterations! Replace initial interm. interm. join reduce result result result result other datasets 28

45000000 # of elements updated Delta Iterations 40000000 35000000 30000000 25000000 • Delta iteration computes 20000000 15000000 – Delta update of solution set 10000000 5000000 – Work set for next iteration 0 1 6 11 16 21 26 31 36 41 46 # of iterations • Work set drives computations of next iteration – Workload of later iterations significantly reduced – Fast convergence • Applicable to certain problem domains – Graph processing 29

Iteration Performance 30 Iterations 61 Iterations (Convergence) PageRank on Twitter Follower Graph 30

Roadmap WHAT IS COMING NEXT? 31

Flink’s Roadmap Mission: Unified stream and batch processing • Exactly-once streaming semantics with flexible state checkpointing • Extending the ML library • Extending graph library • Interactive programs • Integration with Apache Zeppelin (incubating) • SQL on top of expression language • And much more… 32

tl;dr – What’s worth to remember? • Flink is general-purpose analytics system • Unifies streaming and batch processing • Expressive high-level APIs • Robust and fast execution engine 34

I Flink, do you? ;-) If you find this exciting, g et involved and start a discussion on Flink‘s ML or stay tuned by subscribing to news@flink.apache.org or following @ApacheFlink on Twitter 35

BACKUP 37

Data Flow Optimizer • Database-style optimizations for parallel data flows • Optimizes all batch programs • Optimizations – Task chaining – Join algorithms – Re-use partitioning and sorting for later operations – Caching for iterations 38

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian - PowerPoint PPT Presentation

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream and batch processing Easy and

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org

Streaming SQL to Unify Batch and Stream Processing: Theory and Practice with Apache Flink at Uber

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Stream Processing with Apache Flink QCon London, March 7, 2016 Robert Metzger @rmetzger_

UI5 a robust HTML5-based user interface for VFX5 Gerald Wodni gerald.wodni@gmail.com 13.09.2019

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Introduction to Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2017/18

Big Data Analytics & IoT Instructor: Ekpe Okorafor 1. Big Data Academy - Accenture 2.

FAQs Submission Deadline for the GEAR Session 1 review Feb 25 Presenters: please upload

Evaluation and Development of Algorithms and Techniques for Streaming Detector Readout

HTTP-like Streams for 9P John Floren Rochester Institute of Technology October 5, 2010 With Ron

CS 237: Interactive Movie Streaming Service Wei Pan, panw4@uci.edu Yi Zhou, zhouy46@uci.edu

Hardware accelerated video streaming with V4L2 on i.MX6Q 05/01/2014 Gabriel Huau Embedded

Trickle : Rate Limiting Video Streaming Monia Ghobadi <monia@cs.toronto.edu> Yuchung

Sambuz

Useful Links

Newsletter

Mail Us

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian - PowerPoint PPT Presentation

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream and batch processing Easy and

Batch &amp; Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org

Streaming SQL to Unify Batch and Stream Processing: Theory and Practice with Apache Flink at Uber

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Stream Processing with Apache Flink QCon London, March 7, 2016 Robert Metzger @rmetzger_

UI5 a robust HTML5-based user interface for VFX5 Gerald Wodni gerald.wodni@gmail.com 13.09.2019

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Introduction to Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2017/18

Big Data Analytics &amp; IoT Instructor: Ekpe Okorafor 1. Big Data Academy - Accenture 2.

FAQs Submission Deadline for the GEAR Session 1 review Feb 25 Presenters: please upload

Evaluation and Development of Algorithms and Techniques for Streaming Detector Readout

HTTP-like Streams for 9P John Floren Rochester Institute of Technology October 5, 2010 With Ron

CS 237: Interactive Movie Streaming Service Wei Pan, panw4@uci.edu Yi Zhou, zhouy46@uci.edu

Hardware accelerated video streaming with V4L2 on i.MX6Q 05/01/2014 Gabriel Huau Embedded

Trickle : Rate Limiting Video Streaming Monia Ghobadi &lt;monia@cs.toronto.edu&gt; Yuchung

Sambuz

Useful Links

Newsletter

Mail Us

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Big Data Analytics & IoT Instructor: Ekpe Okorafor 1. Big Data Academy - Accenture 2.

Trickle : Rate Limiting Video Streaming Monia Ghobadi <monia@cs.toronto.edu> Yuchung