No Shard Left Behind Straggler-free data processing in Cloud - - PowerPoint PPT Presentation

no shard left behind
SMART_READER_LITE
LIVE PREVIEW

No Shard Left Behind Straggler-free data processing in Cloud - - PowerPoint PPT Presentation

No Shard Left Behind Straggler-free data processing in Cloud Dataflow Eugene Kirpichov Senior Software Engineer Workers Time Google Cloud Platform 2 Google Cloud Platform 3 Plan 01 Intro 04 Autoscaling Setting the stage Why dynamic


slide-1
SLIDE 1

Eugene Kirpichov Senior Software Engineer

No Shard Left Behind

Straggler-free data processing in Cloud Dataflow

slide-2
SLIDE 2

Google Cloud Platform 2

Workers Time

slide-3
SLIDE 3

Google Cloud Platform 3

slide-4
SLIDE 4

Plan

Autoscaling Why dynamic rebalancing really matters If you remember two things Philosophy of everything above 01 02 03 Intro Setting the stage Stragglers Where they come from and how people fight them Dynamic rebalancing 1 How it works 2 Why is it hard 04 05

slide-5
SLIDE 5

01 Intro

Setting the stage

slide-6
SLIDE 6

Google Cloud Platform 6

Google’s data processing timeline

2012 2002 2004 2006 2008 2010

MapReduce

GFS Big Table Dremel Pregel

FlumeJava

Colossus Spanner

2014

MillWheel

Dataflow

2016

Apache Beam

slide-7
SLIDE 7

Google Cloud Platform 7

Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(FlatMapElements.via( word → Arrays.asList(word.split("[^a-zA-Z']+")))) .apply(Filter.byPredicate(word → !word.isEmpty())) .apply(Count.perElement()) .apply(MapElements.via( count → count.getKey() + ": " + count.getValue()) .apply(TextIO.Write.to("gs://.../...")); p.run();

WordCount

slide-8
SLIDE 8

Google Cloud Platform 8

A K, V B K, [V]

DoFn: A → [B]

ParDo

GBK

GroupByKey

MapReduce = ParDo + GroupByKey + ParDo

slide-9
SLIDE 9

Google Cloud Platform 9

DoFn DoFn DoFn

Running a ParDo

shard 1 shard 2 shard N DoFn

slide-10
SLIDE 10

Google Cloud Platform 10

Gantt charts

shard N Workers Time

slide-11
SLIDE 11

Google Cloud Platform 11

Large WordCount: Read files, GroupByKey, Write files. 400 workers 20 minutes

slide-12
SLIDE 12

02 Stragglers

Where they come from, and how people fight them

slide-13
SLIDE 13

Google Cloud Platform 13

Stragglers

Workers Time

slide-14
SLIDE 14

Google Cloud Platform 14

Amdahl’s law: it gets worse at scale

Higher scale ⇒ More bottlenecked by serial parts.

#workers serial fraction

slide-15
SLIDE 15

Google Cloud Platform 15

Process dictionary in parallel by first letter: ⅙ words start with ‘t’ ⇒ < 6x speedup

Where do stragglers come from?

Spuriously slow external RPCs Bugs Join Foos / Bars, in parallel by Foos. Some Foos have ≫ Bars than others. Bad machines Bad network Resource contention Uneven resources Noise Uneven partitioning Uneven complexity

slide-16
SLIDE 16

Google Cloud Platform 16

Oversplit Hand-tune Use data statistics

What would you do?

Uneven resources Backups Restarts Noise Uneven partitioning Uneven complexity Predictive ⇒ Unreliable Weak

slide-17
SLIDE 17

Google Cloud Platform 17

These kinda work. But not really.

Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost always tuned wrong Statistics often missing / wrong Doesn’t exist for intermediate data Size != complexity Backups/restarts only address slow workers

slide-18
SLIDE 18

Confidential & Proprietary Google Cloud Platform 18

Upfront heuristics don’t work: will predict wrong. Higher scale → more likely.

slide-19
SLIDE 19

Confidential & Proprietary Google Cloud Platform 19

High scale triggers worst-case behavior.

Corollary: If you’re bottlenecked by worst-case behavior, you won’t scale.

slide-20
SLIDE 20

03.1 Dynamic rebalancing

How it works

slide-21
SLIDE 21

Google Cloud Platform 21

Detect and fight stragglers

Workers Time

slide-22
SLIDE 22

Google Cloud Platform 22

What is a straggler, really?

Workers Time

Slower than perfectly-parallel: tend > sum(tend) / N

slide-23
SLIDE 23

Google Cloud Platform 23

Split stragglers, return residuals into pool of work

Workers Time

Now Avg completion time 100 130 200 foo.txt 170 100 200 170 170 keep running schedule (cheap, atomic)

slide-24
SLIDE 24

Google Cloud Platform 24

Rinse, repeat (“liquid sharding”)

Workers Time

Now Avg completion time

slide-25
SLIDE 25

Google Cloud Platform 25

1 ParDo Skewed 24 workers ParDo/GBK/ParDo Uniform 400 workers

50% 25%

slide-26
SLIDE 26

Confidential & Proprietary Google Cloud Platform 26

Get out of trouble > avoid trouble

Adaptive > Predictive

slide-27
SLIDE 27

03.2 Dynamic rebalancing

Why is it hard?

slide-28
SLIDE 28

Google Cloud Platform 28

And that’s it? What’s so hard?

Wait-free Perfect granularity What can be split? Data consistency Not just files APIs Non-uniform density Stuckness “Dark matter” Making predictions Testing consistency Debugging Measuring quality Being sure it works Semantics Quality

slide-29
SLIDE 29

Google Cloud Platform 29

What is splitting

foo.txt [100, 200) foo.txt [100, 170) foo.txt [170, 200) split at 170

slide-30
SLIDE 30

Google Cloud Platform 30

What is splitting: Associativity

[A, B) + [B, C) = [A, C)

slide-31
SLIDE 31

Google Cloud Platform 31

What is splitting: Rounding up

[A, B) = records starting in [A, B) Random access ⇒ Can split without scanning data!

slide-32
SLIDE 32

Google Cloud Platform 32

What is splitting: Rounding up

[A, B) = records starting in [A, B) Random access ⇒ Can split without scanning data!

apple beet fig grape kiwi lime pear rose squash vanilla

[a, h) [h, s) [s, $)

apple beet fig grape kiwi lime pear rose squash vanilla

slide-33
SLIDE 33

Google Cloud Platform 33

What is splitting: Blocks

[A, B) = records in blocks starting in [A, B)

slide-34
SLIDE 34

Google Cloud Platform 34

What is splitting: Readers

foo.txt [100, 200) foo.txt [100, 170) foo.txt [170, 200) split at 170 “Reader”

Re-reading consistency: continue until EOF = re-read shard

slide-35
SLIDE 35

Google Cloud Platform 35

Dynamic splitting: readers

read not yet read not ok

  • k

e.g. can’t split an arbitrary SQL query X = last record read: Exact, Increasing

slide-36
SLIDE 36

Confidential & Proprietary Google Cloud Platform 36

[A, B) = blocks of records starting in [A, B) [A, B) + [B, C) = [A, C) Random access ⇒ No scanning needed to split Reading repeatable, ordered by position, positions exact

slide-37
SLIDE 37

Google Cloud Platform 37

Concurrency when splitting

Read Process

...

time

should I split? ? ? ? While we wait, 1000s of workers idle. Per-element processing in O(hours) is common!

slide-38
SLIDE 38

Google Cloud Platform 38

Concurrency when splitting

Read Process

...

Read Process

... split! ok. Split wait-free (but race-free), while processing/reading. see code: RangeTracker Per-element processing in O(hours) is common! split! ok. should I split? ? ? ? While we wait, 1000s of workers idle.

slide-39
SLIDE 39

Google Cloud Platform 39

Perfectly granular splitting

“Few records, heavy processing” is common. ⇒ Perfect parallelism required

slide-40
SLIDE 40

Confidential & Proprietary Google Cloud Platform 40

Separation: ParDo { record → sleep(∞) } parallelized perfectly

(requires wait-free + perfectly granular)

slide-41
SLIDE 41

Google Cloud Platform 41

Separation is a qualitative improvement

/path/to/foo*.txt ParDo: expand glob

foo5.txt foo42.txt foo8.txt foo100.txt foo91.txt

ParDo: read records perfectly parallel

  • ver files

perfectly parallel

  • ver records

infinite scalability (no “shard per file”)

foo26.txt foo87.txt foo56.txt

See also: Splittable DoFn http://s.apache.org/splittable-do-fn

slide-42
SLIDE 42

Confidential & Proprietary Google Cloud Platform 42

“Practical” solutions improve performance “No compromise” solutions reduce dimension

  • f the problem space
slide-43
SLIDE 43

Google Cloud Platform 43

slide-44
SLIDE 44

Google Cloud Platform 44

Making predictions: easy, right?

100 130 200 ~30% complete: 130 / [100, 200) = 0.3 Split at 70%: 0.7 [100, 200) = 170

apple beet fig grape kiwi

~50% complete: k / [a, z) ≈ 0.5 Split at 70%: 0.7 [a, z) ≈ t t 70%

slide-45
SLIDE 45

Google Cloud Platform 45

100% t Progress 100% t Progress 100% t Progress 100% t Progress 100% Progress t

Easy; usually too good to be true.

100% t100% t Progress tx px

slide-46
SLIDE 46

Confidential & Proprietary Google Cloud Platform 46

Accurate predictions = wrong goal, infeasible. Wildly off ⇒ System should still work Optimize for emergent behavior (separation) Better goal: detect stuckness

slide-47
SLIDE 47

Google Cloud Platform 47

Heavy work that you don’t know exists, until you hit it. Goal: discover and distribute dark matter as quickly as possible. (Image credit: NASA)

Dark matter

47

slide-48
SLIDE 48

04 Autoscaling

Why dynamic rebalancing really matters

slide-49
SLIDE 49

49

How much work will there be? Can’t predict: data size, complexity, etc. What should you do? Adaptive > Predictive. Keep re-estimating total work; scale up/down (Image credit: Wikipedia)

A lot of work ⇒ A lot of workers

49

slide-50
SLIDE 50

50

Start off with 3 workers, things are looking okay 10m 3 days Re-estimation ⇒ orders of magnitude more work: need 100 workers! 92 workers idle 100 workers useless without 100 pieces of work!

slide-51
SLIDE 51

Google Cloud Platform 51

Now scaling up is no big deal! Add workers Work distributes itself Job smoothly scales 3 → 1000 workers.

Autoscaling + dynamic rebalancing

Waves of splitting Upscaling & VM startup

slide-52
SLIDE 52

05 If you remember two things

Philosophy of everything above

slide-53
SLIDE 53

Google Cloud Platform 53

Reducing dimension > Incremental improvement “Corner cases” are clues that you’re still compromising

If you remember two things

Adaptive > Predictive “No compromise” solutions matter Fighting stragglers > Preventing stragglers Emergent behavior > Local precision

wait-free perfectly granular separation heavy records reading-as-ParDo rebalancing autoscaling reusability

slide-54
SLIDE 54

Confidential & Proprietary Google Cloud Platform 54

Thank you Q&A

slide-55
SLIDE 55

Google Cloud Platform 55

Apache Beam No shard left behind: Dynamic work rebalancing in Cloud Dataflow Comparing Cloud Dataflow Autoscaling to Spark and Hadoop Splittable DoFn Documentation on Dataflow/Beam source APIs

References