Radically modular data ingestion APIs in Apache Beam Eugene - PowerPoint PPT Presentation

Radically modular data ingestion APIs in Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer

Plan 01 Intro to Beam 04 Splittable DoFn Unified, portable data processing Missing piece for composable sources 02 IO — APIs for data ingestion 05 Recap What's the big deal If you remember two things 03 Composable IO IO as data processing

01 Intro to Beam Unified, portable data processing

(2008) FlumeJava High-level API (2016) Apache Beam (2014) Dataflow (2004) MapReduce Batch/streaming agnostic, Open, SELECT + GROUPBY Portable across Community-driven, languages & runners Vendor-independent (2013) Millwheel Deterministic streaming Google Cloud Platform 4

Batch vs. streaming is moot — Beam (Batch is nearly always part of higher-level streaming) Google Cloud Platform 5

Google Cloud Platform 6

Beam PTransforms DoFn ParDo GroupByKey Composite (good old FlatMap) Google Cloud Platform 7

User code Libraries of PTransforms, IO SDK (per language) Runner Google Cloud Platform 8

Pipeline p = Pipeline.create(options); Read text files PCollection<String> lines = p.apply( TextIO.read().from ( "gs://.../*" )); Split into words PCollection<KV<String, Long>> wordCounts = lines .apply( FlatMapElements.via (word → word.split( "\\W+" ))) .apply( Count.perElement() ); Count wordCounts .apply( MapElements.via ( Format count → count.getKey() + ": " + count.getValue()) .apply( TextIO.write().to ( "gs://.../..." )); Write text files p.run(); Google Cloud Platform 9

02 IO - APIs for data ingestion What's the big deal

Beam IO Files Hadoop Hive Text/Avro/XML/… MQTT Solr HDFS, S3, GCS JDBC Elasticsearch Kafka MongoDb BigQuery Kinesis Redis BigTable AMQP Cassandra Datastore Pubsub HBase Spanner JMS Google Cloud Platform 11

IO is essential Most pipelines move data from X to Y ETL: E xtract, T ransform, L oad Google Cloud Platform 12

IO is messy E T L Cozy, pure programming model Google Cloud Platform 13

IO is messy E T L � � Cozy, pure programming model Google Cloud Platform 14

IO is messy Read via CSV dump Dead-letter failed records Read multiple tables in tx Clean up temp files Read tons of small files Stream new files Preserve filenames Skip headers Quotas & size limits Route to different tables Write A, then write B Rate limiting / throttling Decompress ZIP Write to A, then read B … Google Cloud Platform 15

IO is unified (batch/streaming agnostic) Classic batch Classic streaming Reality Read files Read Kafka Read files + watch new files Write files Stream to Kafka Stream files Read Kafka from start + tail Google Cloud Platform 16

IO is unified (batch/streaming agnostic) changes changes Evolves Evolves Keeps output = f(input) https://www.infoq.com/presentations/beam-model-stream-table=theory Google Cloud Platform 17

IO is unforgiving Correctness Performance Any bug = data corruption Unexpected scale Fault tolerance Throughput, latency, memory, parallelism Exactly-once reads/writes Error handling Google Cloud Platform 18

IO is a chance to do better Nobody writes a paper about their IO API. I made a bigdata (MapReduce paper — 3 paragraphs; Spark, Flink, Beam: 0) programming model Cool, how does data Requirements too diverse get in and out? to support everything out of the box APIs too rigid Brb to let users do it themselves Google Cloud Platform 19

IO is essential, but messy and unforgiving. It begs for good abstractions. Google Cloud Platform Confidential & Proprietary 20

03 Composable IO IO as data processing

Traditionally: ad-hoc API, at pipeline boundary "Source" "Transform" "Sink" A B InputFormat / Receiver / SourceFunction / ... OutputFormat / Sink / SinkFunction / ... Configuration: Configuration: Filepattern Directory Query string Table name Topic name Topic name … … Google Cloud Platform 22

Traditionally: ad-hoc API, at pipeline boundary "Source" "Transform" "Sink" A B My filenames come on a I want to know which Narrow APIs Kafka topic. records failed to write are not hackable I have a table per client + I want to kick off another table of clients transform after writing Google Cloud Platform 23

IO is just another data processing task Parse Globs Records Parameters Rows Execute files queries Invalid rows Rows Import to Import database statistics Google Cloud Platform 24

IO is just another data processing task Google Cloud Platform Confidential & Proprietary 25

Composability (aka hackability) Unified batch/streaming Transparent fault tolerance The rest of the programming model has Scalability been getting this for free all along. (read 1M files = process 1M elements) Join the party. Monitoring, debugging Orchestration (do X, then read / write, then do Y) Future features Google Cloud Platform 26

IO in Beam: just transforms Google Cloud Platform 27

BigQueryIO.write(): (write to files, call import API) Dynamic routing Cleanup Sharding to fit under API limits … Pretty complex, but arbitrarily powerful Google Cloud Platform 28

Composability ⇒ Modularity What can be composed, can be decomposed. Image credit: Wikimedia Google Cloud Platform 29

Read text file globs Expand globs Read as Glob Filename String text file Watch new ( tail -f ) results Read Kafka topic List partitions Topic Topic, partition Message Read partition Watch new results Google Cloud Platform 30

Read DB via CSV Table Glob Row Read text Invoke dump Parse CSV file globs Write db via CSV Row Filename Done signal Write to Invoke Format CSV text files import Google Cloud Platform 31

Row Done signal Import to DB#1 Done signal Row Row Import to Wait DB#2 Consistent import into 2 databases Google Cloud Platform 32

What can be composed, can be decomposed. Google Cloud Platform Confidential & Proprietary 33

What this means for you Library authors Users Ignore native IO APIs if possible Ignore native IO APIs if possible Unify batch & streaming Assemble what you need from powerful primitives Decompose ruthlessly Google Cloud Platform 34

04 Splittable DoFn Missing piece for composable sources

Typical IO transforms Read Read Split each ParDo Write { REST call } Google Cloud Platform 36

Read Kafka topic List partitions Topic Topic, partition Message Read partition Watch new results Infinite output per input Read text file globs Expand globs Glob Filename String Read as Watch new text file results No parallelism within file* *No Shard Left Behind: Straggler-free data processing in Cloud Dataflow Google Cloud Platform 37

What ParDo can't do DoFn Per-element work is indivisible black box ⇒ can't be infinite ⇒ can't be parallelized further Google Cloud Platform 38

Splittable DoFn (SDF): Partial work via restrictions Element Element: what work DoFn Restriction: what part of the work Dynamically Splittable (Element , Restriction ) Design: s.apache.org/splittable-do-fn SDF Google Cloud Platform 39

Example restrictions Element Restriction Reading splittable files filename start offset, end offset Reading Bigtable (table, filter, columns) start key, end key Reading Kafka (topic, partition) start offset, end offset Google Cloud Platform 40

( , ) SDF Splitting restriction ( , ) ( , ) ( , ) SDF SDF SDF Google Cloud Platform 41

Google Cloud Platform 42

Unbounded work per element Finite Google Cloud Platform 43

Anatomy of an SDF How to process 1 element? Read a text file: (String filename) → records How to do it in parts? Reading byte sub-ranges How to describe 1 part? ( restriction ) {long start, long end} How to do this part of this element? f = open(element); f.seek(start); while(f.tell() < end) { yield f.readLine(); } Google Cloud Platform 44

Dynamic splitting of restrictions (basically work stealing) Runner Split! process(e, r) Restriction process(e, r) Primary (keeps running) Residual (can start in parallel) Google Cloud Platform 45

class ReadAvroFn extends DoFn<Filename, AvroRecord> { void processElement(ProcessContext c, OffsetRange range) { try (AvroReader r = Avro.open(c.element())) { for (r.seek(range.start()); r.currentBlockOffset() < range.end(); r.readNextBlock()) { for (AvroRecord record : r.currentBlock()) { c.output(record); } } } } } Google Cloud Platform 46

class ReadAvroFn extends DoFn<Filename, AvroRecord> { void processElement(ProcessContext c, OffsetRange range) { try (AvroReader r = Avro.open(c.element())) { for (r.seek(range.start()); range can change r.currentBlockOffset() < range.end(); concurrently r.readNextBlock()) { for (AvroRecord record : r.currentBlock()) { c.output(record); } } } } } Google Cloud Platform 47

Radically modular data ingestion APIs in Apache Beam Eugene - PowerPoint PPT Presentation

Radically modular data ingestion APIs in Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer Plan 01 Intro to Beam 04 Splittable DoFn Unified, portable data processing Missing piece for composable sources 02

Modular Budgets Modular Budgets Modular Budgets Modular Budgets OSPA NANO Session 10/25/06

Introduction to Apache Beam Dan Halperin JB Onofr Google Talend Beam podling PMC Beam

History and Biology Thursday, April 3, 14 Apis Cerana Apis Cerana Thursday, April 3, 14 Apis

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Practical R: Data Ingestion and Munging Practical R: Data Ingestion and Munging Abhijit Dasgupta

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

y RO-DBT via telehealth p o A Radically Open Guide to Webcam-delivered Treatment c t o

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Simplifying ML Workflows with Apache Beam & TensorFlow Extended Tyler Akidau @takidau

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Apache Solr An experience report 2013-10-23 - Corsin Decurtins Apache Solr Notes Full-Text

Scalable Data Ingestion Architecture Using Airflow and Spark April 17, 2019 Johannes Lepp

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

DVAT LATEST AMENDMENTS Lunawat & Co. CA Rajesh Saluja Latest Circulars Circular Dt.

Construction Defect Insurance Claims Best Practices for Policyholders and Insurers to Resolve

Strengthening IPCC Skills as a Means of Reducing Treatment Default IUATLD Conference, Europe

There are so many systems for grading evidence and recommendations out there. Why do we need

Oil & Gas Industry: Indirect tax By Santosh R. Sonar 12 January 2013 Oil and Gas - Indirect

Form 10-Q x QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF

LEAVE YOUR MARK SUMMARY The May 2018 EU Summit in Sofia and the June 2018 EU Council of Ministers

PRELIMINARY ANNUAL FINANCIAL RESULTS on 31.12.2019 a) Situation of preliminary financial position

Radically modular data ingestion APIs in Apache Beam Eugene - PowerPoint PPT Presentation

Radically modular data ingestion APIs in Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer Plan 01 Intro to Beam 04 Splittable DoFn Unified, portable data processing Missing piece for composable sources 02

Modular Budgets Modular Budgets Modular Budgets Modular Budgets OSPA NANO Session 10/25/06

Introduction to Apache Beam Dan Halperin JB Onofr Google Talend Beam podling PMC Beam

History and Biology Thursday, April 3, 14 Apis Cerana Apis Cerana Thursday, April 3, 14 Apis

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Practical R: Data Ingestion and Munging Practical R: Data Ingestion and Munging Abhijit Dasgupta

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

y RO-DBT via telehealth p o A Radically Open Guide to Webcam-delivered Treatment c t o

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Simplifying ML Workflows with Apache Beam &amp; TensorFlow Extended Tyler Akidau @takidau

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Apache Solr An experience report 2013-10-23 - Corsin Decurtins Apache Solr Notes Full-Text

Scalable Data Ingestion Architecture Using Airflow and Spark April 17, 2019 Johannes Lepp

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

DVAT LATEST AMENDMENTS Lunawat &amp; Co. CA Rajesh Saluja Latest Circulars Circular Dt.

Construction Defect Insurance Claims Best Practices for Policyholders and Insurers to Resolve

Strengthening IPCC Skills as a Means of Reducing Treatment Default IUATLD Conference, Europe

There are so many systems for grading evidence and recommendations out there. Why do we need

Oil &amp; Gas Industry: Indirect tax By Santosh R. Sonar 12 January 2013 Oil and Gas - Indirect

Form 10-Q x QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF

LEAVE YOUR MARK SUMMARY The May 2018 EU Summit in Sofia and the June 2018 EU Council of Ministers

PRELIMINARY ANNUAL FINANCIAL RESULTS on 31.12.2019 a) Situation of preliminary financial position

Simplifying ML Workflows with Apache Beam & TensorFlow Extended Tyler Akidau @takidau

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

DVAT LATEST AMENDMENTS Lunawat & Co. CA Rajesh Saluja Latest Circulars Circular Dt.

Oil & Gas Industry: Indirect tax By Santosh R. Sonar 12 January 2013 Oil and Gas - Indirect