Abstract Apache Beam is a unified programming model capable of - PowerPoint PPT Presentation

Abstract Apache Beam is a unified programming model capable of expressing a wide variety of both traditional batch and complex streaming use cases. By neatly separating properties of the data from run-time characteristics, Beam enables users to easily tune requirements around completeness and latency and run the same pipeline across multiple runtime environments. In addition, Beam's model enables cutting edge optimizations, like dynamic work rebalancing and autoscaling, giving those runtimes the ability to be highly efficient. This talk will cover the basics of Apache Beam, touch on its evolution, and describe the main concepts in its powerful programming model. We'll include detailed, concrete examples of how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios.

Using Apache Beam for Batch, Streaming, and Everything in Between Dan Halperin (@dhalperi) Apache Beam PMC Senior Software Engineer, Google

Apache Beam: Open Source Data Processing APIs Expresses data-parallel batch and streaming algorithms with one unified API. Cleanly separates data processing logic from runtime requirements. Supports execution on multiple distributed processing runtime environments. Integrates with the larger data processing ecosystem.

Announcing the First Stable Release

Apache Beam at this conference Using Apache Beam for Batch, Streaming, and Everything in Between • Dan Halperin @ 10:15 am Apache Beam: Integrating the Big Data Ecosystem Up, Down, and Sideways • Davor Bonaci, and Jean-Baptiste Onofré @ 11:15 am Concrete Big Data Use Cases Implemented with Apache Beam • Jean-Baptiste Onofré @ 12:15 pm Nexmark, a Unified Framework to Evaluate Big Data Processing Systems • Ismaël Mejía, and Etienne Chauchot @ 2:30 pm

Apache Beam at this conference Apache Beam Birds of a Feather • Wednesday, 6:30 pm - 7:30 pm Apache Beam Hacking Time • Time: all-day Thursday • 2nd floor collaboration area • (depending on interest)

This talk: Apache Beam introduction and update

This talk: Apache Beam introduction and update Apache Beam is a unified programming model designed to provide e ffj cient and portable data processing pipelines

The Beam Model: Asking the Right Questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

1.Classic Batch 2. Batch with Fixed 3. Sessions Windows 4. Streaming 5. Streaming with Speculative + Late Data

What is Apache Beam? What is Apache Beam? The Beam Programming Model The Beam Programming Model Other Beam Beam Java • What / Where / When / How   Languages Python SDKs for writing Beam pipelines • Java, Python SDKs for writing Beam pipelines Beam Model: Pipeline Construction • Java, Python Beam Runners for existing distributed processing backends Apache Apache Apache Cloud Apache Beam Runners for existing distributed Apache Apache Apex Flink Gearpump Spark Dataflow processing backends Apex Apache • Apache Apex Gearpump Beam Model: Fn Runners Apache • Apache Flink Google Cloud Dataflow • Apache Spark Execution Execution Execution • Google Cloud Dataflow

Apache Beam is a unified programming model designed to provide e ffj cient and portable data processing pipelines

Simple clickstream analysis pipeline Data : JSON-encoded analytics stream from site • {“user”:“dhalperi”, “page”:“apache.org/feed/7”, “tstamp”:”2016-08-31T15:07Z”, …} Event time 3:00 3:05 3:10 3:15 3:20 3:25 Desired output : Per-user session length and activity level • dhalperi, 33 pageviews, 2016-08-31 15:04-15:25

Simple clickstream analysis pipeline Data : JSON-encoded analytics stream from site • {“user”:“dhalperi”, “page”:“apache.org/feed/7”, “tstamp”:”2016-08-31T15:07Z”, …} One session, 3:04-3:25 Event time 3:00 3:05 3:10 3:15 3:20 3:25 Desired output : Per-user session length and activity level • dhalperi, 33 pageviews, 2016-08-31 15:04-15:25

Two example applications Streaming job consuming Kafka stream • Uses 10 workers. • Pipeline lag of a few seconds. • With a 2 million users over 1 day. • Want fresh, correct results at low latency • Okay to use more resources

Two example applications Streaming job consuming Kafka stream Batch job consuming HDFS archive • Uses 10 workers. • Uses 200 workers. • Pipeline lag of a few seconds. • Runs for 30 minutes. • With a 2 million users over 1 day. • Same input. • Want fresh, correct results at low latency • Accurate results at job completion • Okay to use more resources • Batch efficiency

Two example applications Streaming job consuming Kafka stream Batch job consuming HDFS archive • Uses 10 workers. • Uses 200 workers. • Pipeline lag of a few seconds. • Runs for 30 minutes. • With a 2 million users over 1 day. • Same input. • Want fresh, correct results at low latency • Accurate results at job completion • Okay to use more resources • Batch efficiency What does the user have to change to get these results?

Two example applications Streaming job consuming Kafka stream Batch job consuming HDFS archive • Uses 10 workers. • Uses 200 workers. • Pipeline lag of a few seconds. • Runs for 30 minutes. • With a 2 million users over 1 day. • Same input. • Want fresh, correct results at low latency • Accurate results at job completion • Okay to use more resources • Batch efficiency What does the user have to change to get these results? A: O(10 lines of code) + Command-line Arguments

Quick overview of the Beam model Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.  

Quick overview of the Beam model Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.   Sources & Readers – produce PCollections of timestamped elements and a watermark.

Quick overview of the Beam model Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.   Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection.

Quick overview of the Beam model Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.   Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co) GroupByKey – shuffle & group {{K: V}} → {K: [V]}.

Quick overview of the Beam model Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.   Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co) GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Side inputs – global view of a PCollection used for broadcast / joins.  

Quick overview of the Beam model Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.   Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co) GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Side inputs – global view of a PCollection used for broadcast / joins.   Window – reassign elements to zero or more windows; may be data-dependent.

Quick overview of the Beam model Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.   Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co) GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Side inputs – global view of a PCollection used for broadcast / joins.   Window – reassign elements to zero or more windows; may be data-dependent. Triggers – user flow control based on window, watermark, element count, lateness.

Quick overview of the Beam model Clean abstractions hide details PCollection – a parallel collection of timestamped elements that are in windows.   Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co) GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Side inputs – global view of a PCollection used for broadcast / joins.   Window – reassign elements to zero or more windows; may be data-dependent. Triggers – user flow control based on window, watermark, element count, lateness. State & Timers – cross-element data storage and callbacks enable complex operations

1.Classic Batch 2. Batch with Fixed 3. Sessions Windows 4. Streaming 5. Streaming with Speculative + Late Data

Simple clickstream analysis pipeline PCollection<KV<User, Click>> clickstream = pipeline.apply(IO.Read(…)) .apply(MapElements.of(new ParseClicksAndAssignUser())); PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark()   .withEarlyFirings(AtPeriod(Minutes(1))))) .apply(Count.perKey()); userSessions.apply(MapElements.of(new FormatSessionsForOutput())) .apply(IO.Write(…)); pipeline.run();

Abstract Apache Beam is a unified programming model capable of - PowerPoint PPT Presentation

Abstract Apache Beam is a unified programming model capable of expressing a wide variety of both traditional batch and complex streaming use cases. By neatly separating properties of the data from run-time characteristics, Beam enables users to

Syntax Liam OConnor CSE, UNSW (and data61) Term3 2019 1 Abstract Syntax Parsing Bindings

Introduction to Abstract Data Types Introduction to Abstract Data Types Abstract Data Type (ADT)

Abstract Classes and Interfaces (?) June 21, 2017 Reading Quiz Abstract Classes A. Abstract

CS 2334: Lab 6 Abstract Classes & Interfaces Andrew H. Fagg: CS2334: Lab 6 1 Abstract Class

Abstract Syntax Trees 27 February 2019 OSU CSE 1 Abstract Syntax Tree An abstract syntax

Abstract DPLL and Abstract DPLL Modulo Theories Robert Nieuwenhuis 1 , Albert Oliveras 1 , and

From abstract -Ramsey theory to abstract ultra-Ramsey Theory Timothy Trujillo SE OP

Abstract Generation Advanced VLSI Design CMPE 641 Abstract Generation Place and route tools do

Abstract Generation Advanced VLSI Design CMPE 414 Abstract Generation Place and route tools do

EIHE-2020 List of Poster Presentation Abstract Abstract Title Author Presenting Email Of User

Abstract ID: 17 Presenting Author: Kambam Gainathi Co-Authors: Renuka Srinivasan Elfride Farokh

CommandButton1 ber Presentation Time Abstract file name Name Abstract Title Authors

Abstract Syntax and Variable Binding (Extended Abstract) Marcelo Fiore Gordon Plotkin Daniele

Guidelines for Oral/Poster Abstract Submission Contents General Abstract Submission Guidelines

4 th ISNC-ASC Guidelines for Abstract Preparation for Oral Presentation and Submission Abstract

Abstract syntax trees COMP 520 Fall 2010 Abstract syntax trees (2) A compiler pass is a

Testing and Debugging Programming for Engineers Winter 2015 Andreas Zeller, Saarland

A Competitive Analysis for Balanced Transactional Memory Workloads Gokarna Sharma and Costas Busch

Scalable RSA Modulus Generation with a Dishonest Majority Muthu Venkitasubramaniam Ligero Inc.

Pacific Graphics Conference & Indigenous Heritage Site Recording Pacific Graphics: Hong

Article 370: A Constitutional Impediment to Resolving the Kashmir Crisis Subodh Atal, Ph. D.

Transitions Follow us on twitter @spsp_mh #spspmh5 Agenda 11.15 - 11.20 Introduction

Welcome to the FCM SA Workshop - 17 May 2017 Opening and Welcome - Dr Tjaart van der Walt co

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning Looking at data 2 Data 3

Abstract Apache Beam is a unified programming model capable of - PowerPoint PPT Presentation

Abstract Apache Beam is a unified programming model capable of expressing a wide variety of both traditional batch and complex streaming use cases. By neatly separating properties of the data from run-time characteristics, Beam enables users to

Syntax Liam OConnor CSE, UNSW (and data61) Term3 2019 1 Abstract Syntax Parsing Bindings

Introduction to Abstract Data Types Introduction to Abstract Data Types Abstract Data Type (ADT)

Abstract Classes and Interfaces (?) June 21, 2017 Reading Quiz Abstract Classes A. Abstract

CS 2334: Lab 6 Abstract Classes &amp; Interfaces Andrew H. Fagg: CS2334: Lab 6 1 Abstract Class

Abstract Syntax Trees 27 February 2019 OSU CSE 1 Abstract Syntax Tree An abstract syntax

Abstract DPLL and Abstract DPLL Modulo Theories Robert Nieuwenhuis 1 , Albert Oliveras 1 , and

From abstract -Ramsey theory to abstract ultra-Ramsey Theory Timothy Trujillo SE OP

Abstract Generation Advanced VLSI Design CMPE 641 Abstract Generation Place and route tools do

Abstract Generation Advanced VLSI Design CMPE 414 Abstract Generation Place and route tools do

EIHE-2020 List of Poster Presentation Abstract Abstract Title Author Presenting Email Of User

Abstract ID: 17 Presenting Author: Kambam Gainathi Co-Authors: Renuka Srinivasan Elfride Farokh

CommandButton1 ber Presentation Time Abstract file name Name Abstract Title Authors

Abstract Syntax and Variable Binding (Extended Abstract) Marcelo Fiore Gordon Plotkin Daniele

Guidelines for Oral/Poster Abstract Submission Contents General Abstract Submission Guidelines

4 th ISNC-ASC Guidelines for Abstract Preparation for Oral Presentation and Submission Abstract

Abstract syntax trees COMP 520 Fall 2010 Abstract syntax trees (2) A compiler pass is a

Testing and Debugging Programming for Engineers Winter 2015 Andreas Zeller, Saarland

A Competitive Analysis for Balanced Transactional Memory Workloads Gokarna Sharma and Costas Busch

Scalable RSA Modulus Generation with a Dishonest Majority Muthu Venkitasubramaniam Ligero Inc.

Pacific Graphics Conference &amp; Indigenous Heritage Site Recording Pacific Graphics: Hong

Article 370: A Constitutional Impediment to Resolving the Kashmir Crisis Subodh Atal, Ph. D.

Transitions Follow us on twitter @spsp_mh #spspmh5 Agenda 11.15 - 11.20 Introduction

Welcome to the FCM SA Workshop - 17 May 2017 Opening and Welcome - Dr Tjaart van der Walt co

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning Looking at data 2 Data 3

CS 2334: Lab 6 Abstract Classes & Interfaces Andrew H. Fagg: CS2334: Lab 6 1 Abstract Class

Pacific Graphics Conference & Indigenous Heritage Site Recording Pacific Graphics: Hong