Radically modular data ingestion APIs in Apache Beam Eugene - - PowerPoint PPT Presentation

radically modular data ingestion apis in apache beam
SMART_READER_LITE
LIVE PREVIEW

Radically modular data ingestion APIs in Apache Beam Eugene - - PowerPoint PPT Presentation

Radically modular data ingestion APIs in Apache Beam Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer Plan 01 Intro to Beam 04 Splittable DoFn Unified, portable data processing Missing piece for composable sources 02


slide-1
SLIDE 1

Eugene Kirpichov <kirpichov@google.com> Staff Software Engineer

Radically modular data ingestion APIs in Apache Beam

slide-2
SLIDE 2

Intro to Beam Unified, portable data processing IO — APIs for data ingestion What's the big deal Composable IO IO as data processing

Plan

01 02 03 Splittable DoFn Missing piece for composable sources Recap If you remember two things 04 05

slide-3
SLIDE 3

01 Intro to Beam

Unified, portable data processing

slide-4
SLIDE 4

Google Cloud Platform 4

(2008) FlumeJava High-level API

(2016) Apache Beam

Open, Community-driven, Vendor-independent (2004) MapReduce SELECT + GROUPBY (2014) Dataflow Batch/streaming agnostic, Portable across languages & runners (2013) Millwheel Deterministic streaming

slide-5
SLIDE 5

Google Cloud Platform 5

Batch vs. streaming is moot — Beam

(Batch is nearly always part of higher-level streaming)

slide-6
SLIDE 6

Google Cloud Platform 6

slide-7
SLIDE 7

Google Cloud Platform 7

Beam PTransforms

ParDo (good old FlatMap) GroupByKey Composite DoFn

slide-8
SLIDE 8

Google Cloud Platform 8

User code Libraries of PTransforms, IO SDK (per language) Runner

slide-9
SLIDE 9

Google Cloud Platform 9

Read text files Split into words Count Format Write text files

Pipeline p = Pipeline.create(options); PCollection<String> lines = p.apply( TextIO.read().from("gs://.../*")); PCollection<KV<String, Long>> wordCounts = lines .apply(FlatMapElements.via(word → word.split("\\W+"))) .apply(Count.perElement()); wordCounts .apply(MapElements.via( count → count.getKey() + ": " + count.getValue()) .apply(TextIO.write().to("gs://.../...")); p.run();

slide-10
SLIDE 10

02 IO - APIs for data ingestion

What's the big deal

slide-11
SLIDE 11

Google Cloud Platform 11

Beam IO

Hive Solr Elasticsearch BigQuery BigTable Datastore Spanner Files

Text/Avro/XML/… HDFS, S3, GCS

Kafka Kinesis AMQP Pubsub JMS Hadoop MQTT JDBC MongoDb Redis Cassandra HBase

slide-12
SLIDE 12

Google Cloud Platform 12

IO is essential Most pipelines move data from X to Y ETL: Extract, Transform, Load

slide-13
SLIDE 13

Google Cloud Platform 13

IO is messy

T E L

Cozy, pure programming model

slide-14
SLIDE 14

Google Cloud Platform 14

IO is messy

T E L

Cozy, pure programming model

slide-15
SLIDE 15

Google Cloud Platform 15

Read via CSV dump Clean up temp files Preserve filenames Route to different tables Decompress ZIP Dead-letter failed records Read tons of small files Skip headers Write A, then write B Write to A, then read B Read multiple tables in tx Stream new files Quotas & size limits Rate limiting / throttling …

IO is messy

slide-16
SLIDE 16

Google Cloud Platform 16

IO is unified

(batch/streaming agnostic)

Classic batch Read files Write files Reality Read files + watch new files Stream files Read Kafka from start + tail Classic streaming Read Kafka Stream to Kafka

slide-17
SLIDE 17

Google Cloud Platform 17

Keeps output = f(input) Evolves Evolves

https://www.infoq.com/presentations/beam-model-stream-table=theory changes changes

IO is unified

(batch/streaming agnostic)

slide-18
SLIDE 18

Google Cloud Platform 18

Performance Unexpected scale Throughput, latency, memory, parallelism

IO is unforgiving

Correctness Any bug = data corruption Fault tolerance Exactly-once reads/writes Error handling

slide-19
SLIDE 19

Google Cloud Platform 19

IO is a chance to do better

Nobody writes a paper about their IO API. (MapReduce paper — 3 paragraphs; Spark, Flink, Beam: 0) Requirements too diverse to support everything out of the box APIs too rigid to let users do it themselves

I made a bigdata programming model Cool, how does data get in and out? Brb

slide-20
SLIDE 20

Confidential & Proprietary Google Cloud Platform 20

IO is essential, but messy and unforgiving. It begs for good abstractions.

slide-21
SLIDE 21

03 Composable IO

IO as data processing

slide-22
SLIDE 22

Google Cloud Platform 22

"Source" "Transform" "Sink"

A B

InputFormat / Receiver / SourceFunction / ... Configuration: Filepattern Query string Topic name …

Traditionally: ad-hoc API, at pipeline boundary

OutputFormat / Sink / SinkFunction / ... Configuration: Directory Table name Topic name …

slide-23
SLIDE 23

Google Cloud Platform 23

"Source" "Transform" "Sink"

A B

Traditionally: ad-hoc API, at pipeline boundary

My filenames come on a Kafka topic. I want to know which records failed to write I want to kick off another transform after writing I have a table per client + table of clients

Narrow APIs are not hackable

slide-24
SLIDE 24

Google Cloud Platform 24

Parse files

IO is just another data processing task

Globs Records

Execute queries

Parameters Rows Import to database Rows Import statistics Invalid rows

slide-25
SLIDE 25

Confidential & Proprietary Google Cloud Platform 25

IO is just another data processing task

slide-26
SLIDE 26

Google Cloud Platform 26

Composability (aka hackability) Unified batch/streaming Transparent fault tolerance Scalability (read 1M files = process 1M elements) Monitoring, debugging Orchestration (do X, then read / write, then do Y) Future features The rest of the programming model has been getting this for free all along. Join the party.

slide-27
SLIDE 27

Google Cloud Platform 27

IO in Beam: just transforms

slide-28
SLIDE 28

Google Cloud Platform 28

BigQueryIO.write():

(write to files, call import API) Dynamic routing Cleanup Sharding to fit under API limits … Pretty complex, but arbitrarily powerful

slide-29
SLIDE 29

Google Cloud Platform 29

Composability ⇒ Modularity

What can be composed, can be decomposed.

Image credit: Wikimedia

slide-30
SLIDE 30

Google Cloud Platform 30

Read text file globs Expand globs Watch new results Read as text file (tail -f) Filename String Glob Read Kafka topic List partitions Watch new results Read partition Topic, partition Message Topic

slide-31
SLIDE 31

Google Cloud Platform 31

Read DB via CSV Invoke dump Read text file globs Glob Row Table Parse CSV Write db via CSV Format CSV Write to text files Done signal Row Invoke import Filename

slide-32
SLIDE 32

Google Cloud Platform 32

Import to DB#1 Done signal Row Import to DB#2 Wait Row Row Done signal Consistent import into 2 databases

slide-33
SLIDE 33

Confidential & Proprietary Google Cloud Platform 33

What can be composed, can be decomposed.

slide-34
SLIDE 34

Google Cloud Platform 34

Library authors Ignore native IO APIs if possible Unify batch & streaming Decompose ruthlessly

What this means for you

Users Ignore native IO APIs if possible Assemble what you need from powerful primitives

slide-35
SLIDE 35

04 Splittable DoFn

Missing piece for composable sources

slide-36
SLIDE 36

Google Cloud Platform 36

Typical IO transforms

Split Read each ParDo { REST call }

Read Write

slide-37
SLIDE 37

Google Cloud Platform 37

Read Kafka topic List partitions Watch new results Read partition Topic, partition Message Topic Infinite output per input Read text file globs Expand globs Watch new results Read as text file Filename String Glob No parallelism within file* *No Shard Left Behind: Straggler-free data processing in Cloud Dataflow

slide-38
SLIDE 38

Google Cloud Platform 38

What ParDo can't do

DoFn

Per-element work is indivisible black box ⇒ can't be infinite ⇒ can't be parallelized further

slide-39
SLIDE 39

Google Cloud Platform 39

Element: what work Restriction: what part of the work Design: s.apache.org/splittable-do-fn

Splittable DoFn (SDF): Partial work via restrictions

DoFn SDF Element (Element, Restriction) Dynamically Splittable

slide-40
SLIDE 40

Google Cloud Platform 40

Example restrictions

Element Restriction Reading splittable files filename start offset, end offset Reading Bigtable (table, filter, columns) start key, end key Reading Kafka (topic, partition) start offset, end offset

slide-41
SLIDE 41

Google Cloud Platform 41

SDF ( , ) SDF ( , ) SDF ( , ) SDF ( , ) Splitting restriction

slide-42
SLIDE 42

Google Cloud Platform 42

slide-43
SLIDE 43

Google Cloud Platform 43

Unbounded work per element

Finite

slide-44
SLIDE 44

Google Cloud Platform 44

Anatomy of an SDF

How to process 1 element? Read a text file: (String filename) → records How to do it in parts? Reading byte sub-ranges How to describe 1 part? (restriction) {long start, long end} How to do this part of this element? f = open(element); f.seek(start); while(f.tell() < end) { yield f.readLine(); }

slide-45
SLIDE 45

Google Cloud Platform 45

Dynamic splitting of restrictions

(basically work stealing)

Restriction process(e, r) Split! Runner process(e, r) Primary (keeps running) Residual (can start in parallel)

slide-46
SLIDE 46

Google Cloud Platform 46

class ReadAvroFn extends DoFn<Filename, AvroRecord> { void processElement(ProcessContext c, OffsetRange range) { try (AvroReader r = Avro.open(c.element())) { for (r.seek(range.start()); r.currentBlockOffset() < range.end(); r.readNextBlock()) { for (AvroRecord record : r.currentBlock()) { c.output(record); } } } } }

slide-47
SLIDE 47

Google Cloud Platform 47

class ReadAvroFn extends DoFn<Filename, AvroRecord> { void processElement(ProcessContext c, OffsetRange range) { try (AvroReader r = Avro.open(c.element())) { for (r.seek(range.start()); r.currentBlockOffset() < range.end(); r.readNextBlock()) { for (AvroRecord record : r.currentBlock()) { c.output(record); } } } } }

range can change concurrently

slide-48
SLIDE 48

Google Cloud Platform 48

Concurrent splitting

Runner: Avoid returning something already done process() call: Avoid doing something already returned Idea: Claiming Contract: process() claims work before doing it Split off only unclaimed work

slide-49
SLIDE 49

Google Cloud Platform 49

Restriction trackers, Blocks and Positions

RestrictionTracker { restriction, what part is claimed } Block Unit of claiming (indivisible portion of work within restriction) Position Identifies a block within restriction

slide-50
SLIDE 50

Google Cloud Platform 50

void processElement(ProcessContext c, OffsetRangeTracker tracker) { try (AvroReader r = Avro.open(c.element())) { for (r.seek(tracker.start()); tracker.tryClaim(r.currentBlockOffset()); r.readNextBlock()) { for (AvroRecord record : r.currentBlock()) { c.output(record); } } } }

true ⇒ safe to process (won't be split off) false ⇒ stop — hit end of restriction

slide-51
SLIDE 51

Google Cloud Platform 51

Fundamental building block for splittable work (primarily, reading data) Unbounded (checkpoints) Dynamically splittable Enables library authors to create higher-level building blocks Matching globs, reading files, reading topics, …

Role in Beam APIs

slide-52
SLIDE 52

05 Recap

If you remember two things

slide-53
SLIDE 53

Google Cloud Platform 53

Batch vs. streaming is moot (including IO) IO is essential, messy, unforgiving

Traditionally: special APIs, neglected, inflexible Begs for better abstractions

Recap: Context

slide-54
SLIDE 54

Google Cloud Platform 54

Composable IO = data processing

Full power of programming model Boycott native APIs Composable = decomposable (smaller building blocks)

Splittable DoFn

Element: what work Restriction: what part of work. Enables composable IO for sources

If you remember two things

slide-55
SLIDE 55

Confidential & Proprietary Google Cloud Platform 55

Q&A