architecture of flink s
play

Architecture of Flink's Streaming Runtime Robert Metzger - PowerPoint PPT Presentation

Architecture of Flink's Streaming Runtime Robert Metzger @rmetzger_ rmetzger@apache.org What is stream processing Real-world data is unbounded and is pushed to systems Right now: people are using the batch paradigm for stream analysis


  1. Architecture of Flink's Streaming Runtime Robert Metzger @rmetzger_ rmetzger@apache.org

  2. What is stream processing  Real-world data is unbounded and is pushed to systems  Right now: people are using the batch paradigm for stream analysis (there was no good stream processor available)  New systems (Flink, Kafka) embrace streaming nature of data Kafka topic Web server Stream processing 2

  3. Flink is a stream processor with many faces Streaming dataflow runtime 3

  4. Flink's streaming runtime 4

  5. Requirements for a stream processor  Low latency • Fast results (milliseconds)  High throughput • handle large data amounts (millions of events per second)  Exactly-once guarantees • Correct results, also in failure cases  Programmability • Intuitive APIs 5

  6. Pipelining Basic building block to “keep the data moving” • Low latency • Operators push data forward • Data shipping as buffers, not tuple- wise • Natural handling of back-pressure 6

  7. Fault Tolerance in streaming  at least once: ensure all operators see all events • Storm: Replay stream in failure case  Exactly once: Ensure that operators do not perform duplicate updates to their state • Flink: Distributed Snapshots • Spark: Micro-batches on batch runtime 7

  8. Flink’s Distributed Snapshots  Lightweight approach of storing the state of all operators without pausing the execution  high throughput, low latency  Implemented using barriers flowing through the topology barrier Operator Kafka Element Data Stream state Consumer Counter offset = 162 value = 152 Before barrier = After barrier = part of the snapshot Not in snapshot (backup till next snapshot) 8

  9. 9

  10. 10

  11. 11

  12. 12

  13. Best of all worlds for streaming  Low latency • Thanks to pipelined engine  Exactly-once guarantees • Distributed Snapshots  High throughput • Controllable checkpointing overhead  Separates app logic from recovery • Checkpointing interval is just a config parameter 13

  14. Throughput of distributed grep aggregate throughput of 175 million 200.000.000 elements per second 180.000.000 160.000.000 Data “ grep ” 140.000.000 Generator operator 120.000.000 30 machines, 120 cores 100.000.000 • Flink achieves 20x 80.000.000 higher throughput 60.000.000 • Flink throughput aggregate throughput 40.000.000 of 9 million elements almost the same per second 20.000.000 with and without 0 exactly-once Flink, no fault Flink, exactly Storm, no Storm, micro- 14 tolerance once (5s) fault tolerance batches

  15. Aggregate throughput for stream record grouping aggregate throughput 100.000.000 of 83 million elements per second 90.000.000 80.000.000 70.000.000 60.000.000 30 machines, 50.000.000 Network 120 cores transfer 40.000.000 30.000.000 20.000.000 8,6 million elements/s 309k elements/s  Flink achieves 260x 10.000.000 0 higher throughput with Flink, no Flink, Storm, no Storm, at fault exactly fault least once fault tolerance tolerance once tolerance 15

  16. Latency in stream record grouping • Measure time for a record to travel from source to sink 99 th percentile latency 60,00 50 ms 50,00 40,00 Receiver : Median latency Data Throughput / Generator 30,00 25 ms 30,00 Latency measure 25,00 20,00 20,00 15,00 10,00 10,00 1 ms 5,00 0,00 0,00 Flink, no Flink, exactly Storm, at Flink, no Flink, Storm, at fault once least once 16 fault exactly least tolerance tolerance once once

  17. 17

  18. Exactly-Once with YARN Chaos Monkey  Validate exactly-once guarantees with state-machine 18

  19. “Faces” of Flink 19

  20. Faces of a stream processor Batch Machine Learning at scale processing Stream processing Graph Analysis Streaming dataflow runtime 20

  21. The Flink Stack Specialized Abstractions / APIs Core APIs Flink Core Streaming dataflow runtime Runtime Deployment 21

  22. APIs for stream and batch case class Word ( word : String, frequency : Int) DataSet API (batch): val lines: DataSet[String] = env.readTextFile(...) lines. flatMap {line => line.split(" ") .map(word => Word (word,1))} . groupBy ( "word" ). sum ( "frequency" ) .print() DataStream API (streaming): val lines: DataStream[String] = env.fromSocketStream(...) lines. flatMap {line => line.split(" ") .map(word => Word (word,1))} . window (Time.of(5,SECONDS)). every (Time.of(1,SECONDS)) . groupBy ( "word" ). sum ( "frequency" ) .print() 22

  23. The Flink Stack Experimental Python API also available DataSet (Java/Scala) DataStream (Java/Scala) Batch Optimizer Graph Builder GroupRed sort forward API independent Dataflow Join Hybrid Hash buildHT probe Graph representation hash-part [0] hash-part [0] DataSource Map lineitem.tbl Filter Data Source orders.tbl Streaming dataflow runtime 23

  24. Batch is a special case of streaming  Batch: run a bounded stream (data set) on a stream processor  Form a global window over the entire data set for join or grouping operations 24

  25. Batch-specific optimizations  Managed memory on- and off-heap • Operators (join, sort, …) with out -of-core support • Optimized serialization stack for user-types  Cost-based Optimizer • Job execution depends on data size 25

  26. The Flink Stack Specialized Abstractions / APIs Core APIs DataSet (Java/Scala) DataStream Flink Core Streaming dataflow runtime Runtime Deployment 26

  27. FlinkML: Machine Learning  API for ML pipelines inspired by scikit-learn  Collection of packaged algorithms • SVM, Multiple Linear Regression, Optimization, ALS, ... val trainingData: DataSet [ LabeledVector ] = ... val testingData: DataSet [ Vector ] = ... val scaler = StandardScaler () val polyFeatures = PolynomialFeatures (). setDegree (3) val mlr = MultipleLinearRegression () val pipeline = scaler. chainTransformer (polyFeatures). chainPredictor (mlr) pipeline. fit (trainingData) val predictions: DataSet [LabeledVector] = pipeline. predict (testingData) 27

  28. Gelly: Graph Processing  Graph API and library  Packaged algorithms • PageRank, SSSP, Label Propagation, Community Detection, Connected Components ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment (); Graph <Long, Long, NullValue> graph = ... DataSet < Vertex <Long, Long>> verticesWithCommunity = graph. run( new LabelPropagation<Long>(30)).getVertices(); verticesWithCommunity. print(); env. execute(); 28

  29. Flink Stack += Gelly, ML Gelly ML DataSet (Java/Scala) DataStream Streaming dataflow runtime 29

  30. Integration with other systems • Hadoop M/R Use Hadoop Input/Output Formats • Mapper / Reducer implementations • Hadoop’s FileSystem implementations • Google Dataflow Run applications implemented against Google’s Data Flow API on premise with Flink • Cascading Run Cascading jobs on Flink, with almost no code change • Benefit from Flink’s vastly better performance than MapReduce • Interactive, web-based data exploration Zeppelin • Machine learning on data streams SAMOA • Storm Compatibility layer for running Storm code • FlinkTopologyBuilder: one line replacement for existing jobs • Wrappers for Storm Spouts and Bolts • Coming soon: Exactly-once with Storm DataSet DataStream 30

  31. Deployment options • Start Flink in your IDE / on your machine Hadoop Local • Local debugging / development using the Table same code as on the cluster DataSet (Java/Scala) Gelly • “bare metal” standalone installation of Flink Cluster ML on a cluster Streaming dataflow runtime Dataflow • Flink on Hadoop YARN (Hadoop 2.2.0+) YARN MRQL • Restarts failed containers Cascading • Support for Kerberos-secured YARN/HDFS setups Zeppelin Tez Table DataStream SAMOA Embedded Dataflow Storm

  32. The full stack Dataflow (WiP) Storm (WiP) Hadoop M/R Zeppelin Cascading Dataflow SAMOA MRQL Table Table Gelly ML DataSet (Java/Scala) DataStream Streaming dataflow runtime Local Cluster Yarn Tez Embedded 32

  33. Closing 33

  34. tl;dr Summary Flink is a software stack of  Streaming runtime • low latency • high throughput • fault tolerant, exactly-once data processing  Rich APIs for batch and stream processing • library ecosystem • integration with many systems  A great community of devs and users  Used in production 34

  35. What is currently happening?  Features in progress: • Master High Availability • Vastly improved monitoring GUI • Watermarks / Event time processing / Windowing rework • Graduate Streaming API out of Beta  0.10.0-milestone-1 is currently voted 35

  36. How do I get started? Mailing Lists: (news | user | dev)@flink.apache.org Twitter : @ApacheFlink Blogs : flink.apache.org/blog, data-artisans.com/blog/ IRC channel : irc.freenode.net#flink Start Flink on YARN in 4 commands: # get the hadoop2 package from the Flink download page at # http://flink.apache.org/downloads.html wget <download url> tar xvzf flink-0.9.1-bin-hadoop2.tgz cd flink-0.9.1/ ./bin/flink run -m yarn-cluster -yn 4 ./examples/flink-java- examples-0.9.1-WordCount.jar 36

  37. Flink Forward: 2 days conference with free training in Berlin, Germany • Schedule: http://flink-forward.org/?post_type=day flink.apache.org 37

  38. Appendix 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend