 
              Radically modular data ingestion APIs in Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer
Plan 01 Intro to Beam 04 Splittable DoFn Unified, portable data processing Missing piece for composable sources 02 IO — APIs for data ingestion 05 Recap What's the big deal If you remember two things 03 Composable IO IO as data processing
01 Intro to Beam Unified, portable data processing
(2008) FlumeJava High-level API (2016) Apache Beam (2014) Dataflow (2004) MapReduce Batch/streaming agnostic, Open, SELECT + GROUPBY Portable across Community-driven, languages & runners Vendor-independent (2013) Millwheel Deterministic streaming Google Cloud Platform 4
Batch vs. streaming is moot — Beam (Batch is nearly always part of higher-level streaming) Google Cloud Platform 5
Google Cloud Platform 6
Beam PTransforms DoFn ParDo GroupByKey Composite (good old FlatMap) Google Cloud Platform 7
User code Libraries of PTransforms, IO SDK (per language) Runner Google Cloud Platform 8
Pipeline p = Pipeline.create(options); Read text files PCollection<String> lines = p.apply( TextIO.read().from ( "gs://.../*" )); Split into words PCollection<KV<String, Long>> wordCounts = lines .apply( FlatMapElements.via (word → word.split( "\\W+" ))) .apply( Count.perElement() ); Count wordCounts .apply( MapElements.via ( Format count → count.getKey() + ": " + count.getValue()) .apply( TextIO.write().to ( "gs://.../..." )); Write text files p.run(); Google Cloud Platform 9
02 IO - APIs for data ingestion What's the big deal
Beam IO Files Hadoop Hive Text/Avro/XML/… MQTT Solr HDFS, S3, GCS JDBC Elasticsearch Kafka MongoDb BigQuery Kinesis Redis BigTable AMQP Cassandra Datastore Pubsub HBase Spanner JMS Google Cloud Platform 11
IO is essential Most pipelines move data from X to Y ETL: E xtract, T ransform, L oad Google Cloud Platform 12
IO is messy E T L Cozy, pure programming model Google Cloud Platform 13
IO is messy E T L � � Cozy, pure programming model Google Cloud Platform 14
IO is messy Read via CSV dump Dead-letter failed records Read multiple tables in tx Clean up temp files Read tons of small files Stream new files Preserve filenames Skip headers Quotas & size limits Route to different tables Write A, then write B Rate limiting / throttling Decompress ZIP Write to A, then read B … Google Cloud Platform 15
IO is unified (batch/streaming agnostic) Classic batch Classic streaming Reality Read files Read Kafka Read files + watch new files Write files Stream to Kafka Stream files Read Kafka from start + tail Google Cloud Platform 16
IO is unified (batch/streaming agnostic) changes changes Evolves Evolves Keeps output = f(input) https://www.infoq.com/presentations/beam-model-stream-table=theory Google Cloud Platform 17
IO is unforgiving Correctness Performance Any bug = data corruption Unexpected scale Fault tolerance Throughput, latency, memory, parallelism Exactly-once reads/writes Error handling Google Cloud Platform 18
IO is a chance to do better Nobody writes a paper about their IO API. I made a bigdata (MapReduce paper — 3 paragraphs; Spark, Flink, Beam: 0) programming model Cool, how does data Requirements too diverse get in and out? to support everything out of the box APIs too rigid Brb to let users do it themselves Google Cloud Platform 19
IO is essential, but messy and unforgiving. It begs for good abstractions. Google Cloud Platform Confidential & Proprietary 20
03 Composable IO IO as data processing
Traditionally: ad-hoc API, at pipeline boundary "Source" "Transform" "Sink" A B InputFormat / Receiver / SourceFunction / ... OutputFormat / Sink / SinkFunction / ... Configuration: Configuration: Filepattern Directory Query string Table name Topic name Topic name … … Google Cloud Platform 22
Traditionally: ad-hoc API, at pipeline boundary "Source" "Transform" "Sink" A B My filenames come on a I want to know which Narrow APIs Kafka topic. records failed to write are not hackable I have a table per client + I want to kick off another table of clients transform after writing Google Cloud Platform 23
IO is just another data processing task Parse Globs Records Parameters Rows Execute files queries Invalid rows Rows Import to Import database statistics Google Cloud Platform 24
IO is just another data processing task Google Cloud Platform Confidential & Proprietary 25
Composability (aka hackability) Unified batch/streaming Transparent fault tolerance The rest of the programming model has Scalability been getting this for free all along. (read 1M files = process 1M elements) Join the party. Monitoring, debugging Orchestration (do X, then read / write, then do Y) Future features Google Cloud Platform 26
IO in Beam: just transforms Google Cloud Platform 27
BigQueryIO.write(): (write to files, call import API) Dynamic routing Cleanup Sharding to fit under API limits … Pretty complex, but arbitrarily powerful Google Cloud Platform 28
Composability ⇒ Modularity What can be composed, can be decomposed. Image credit: Wikimedia Google Cloud Platform 29
Read text file globs Expand globs Read as Glob Filename String text file Watch new ( tail -f ) results Read Kafka topic List partitions Topic Topic, partition Message Read partition Watch new results Google Cloud Platform 30
Read DB via CSV Table Glob Row Read text Invoke dump Parse CSV file globs Write db via CSV Row Filename Done signal Write to Invoke Format CSV text files import Google Cloud Platform 31
Row Done signal Import to DB#1 Done signal Row Row Import to Wait DB#2 Consistent import into 2 databases Google Cloud Platform 32
What can be composed, can be decomposed. Google Cloud Platform Confidential & Proprietary 33
What this means for you Library authors Users Ignore native IO APIs if possible Ignore native IO APIs if possible Unify batch & streaming Assemble what you need from powerful primitives Decompose ruthlessly Google Cloud Platform 34
04 Splittable DoFn Missing piece for composable sources
Typical IO transforms Read Read Split each ParDo Write { REST call } Google Cloud Platform 36
Read Kafka topic List partitions Topic Topic, partition Message Read partition Watch new results Infinite output per input Read text file globs Expand globs Glob Filename String Read as Watch new text file results No parallelism within file* *No Shard Left Behind: Straggler-free data processing in Cloud Dataflow Google Cloud Platform 37
What ParDo can't do DoFn Per-element work is indivisible black box ⇒ can't be infinite ⇒ can't be parallelized further Google Cloud Platform 38
Splittable DoFn (SDF): Partial work via restrictions Element Element: what work DoFn Restriction: what part of the work Dynamically Splittable (Element , Restriction ) Design: s.apache.org/splittable-do-fn SDF Google Cloud Platform 39
Example restrictions Element Restriction Reading splittable files filename start offset, end offset Reading Bigtable (table, filter, columns) start key, end key Reading Kafka (topic, partition) start offset, end offset Google Cloud Platform 40
( , ) SDF Splitting restriction ( , ) ( , ) ( , ) SDF SDF SDF Google Cloud Platform 41
Google Cloud Platform 42
Unbounded work per element Finite Google Cloud Platform 43
Anatomy of an SDF How to process 1 element? Read a text file: (String filename) → records How to do it in parts? Reading byte sub-ranges How to describe 1 part? ( restriction ) {long start, long end} How to do this part of this element? f = open(element); f.seek(start); while(f.tell() < end) { yield f.readLine(); } Google Cloud Platform 44
Dynamic splitting of restrictions (basically work stealing) Runner Split! process(e, r) Restriction process(e, r) Primary (keeps running) Residual (can start in parallel) Google Cloud Platform 45
class ReadAvroFn extends DoFn<Filename, AvroRecord> { void processElement(ProcessContext c, OffsetRange range) { try (AvroReader r = Avro.open(c.element())) { for (r.seek(range.start()); r.currentBlockOffset() < range.end(); r.readNextBlock()) { for (AvroRecord record : r.currentBlock()) { c.output(record); } } } } } Google Cloud Platform 46
class ReadAvroFn extends DoFn<Filename, AvroRecord> { void processElement(ProcessContext c, OffsetRange range) { try (AvroReader r = Avro.open(c.element())) { for (r.seek(range.start()); range can change r.currentBlockOffset() < range.end(); concurrently r.readNextBlock()) { for (AvroRecord record : r.currentBlock()) { c.output(record); } } } } } Google Cloud Platform 47
Recommend
More recommend