Eugene Kirpichov Senior Software Engineer
No Shard Left Behind Straggler-free data processing in Cloud - - PowerPoint PPT Presentation
No Shard Left Behind Straggler-free data processing in Cloud - - PowerPoint PPT Presentation
No Shard Left Behind Straggler-free data processing in Cloud Dataflow Eugene Kirpichov Senior Software Engineer Workers Time Google Cloud Platform 2 Google Cloud Platform 3 Plan 01 Intro 04 Autoscaling Setting the stage Why dynamic
Google Cloud Platform 2
Workers Time
Google Cloud Platform 3
Plan
Autoscaling Why dynamic rebalancing really matters If you remember two things Philosophy of everything above 01 02 03 Intro Setting the stage Stragglers Where they come from and how people fight them Dynamic rebalancing 1 How it works 2 Why is it hard 04 05
01 Intro
Setting the stage
Google Cloud Platform 6
Google’s data processing timeline
2012 2002 2004 2006 2008 2010
MapReduce
GFS Big Table Dremel Pregel
FlumeJava
Colossus Spanner
2014
MillWheel
Dataflow
2016
Apache Beam
Google Cloud Platform 7
Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(FlatMapElements.via( word → Arrays.asList(word.split("[^a-zA-Z']+")))) .apply(Filter.byPredicate(word → !word.isEmpty())) .apply(Count.perElement()) .apply(MapElements.via( count → count.getKey() + ": " + count.getValue()) .apply(TextIO.Write.to("gs://.../...")); p.run();
WordCount
Google Cloud Platform 8
A K, V B K, [V]
DoFn: A → [B]
ParDo
GBK
GroupByKey
MapReduce = ParDo + GroupByKey + ParDo
Google Cloud Platform 9
DoFn DoFn DoFn
Running a ParDo
shard 1 shard 2 shard N DoFn
Google Cloud Platform 10
Gantt charts
shard N Workers Time
Google Cloud Platform 11
Large WordCount: Read files, GroupByKey, Write files. 400 workers 20 minutes
02 Stragglers
Where they come from, and how people fight them
Google Cloud Platform 13
Stragglers
Workers Time
Google Cloud Platform 14
Amdahl’s law: it gets worse at scale
Higher scale ⇒ More bottlenecked by serial parts.
#workers serial fraction
Google Cloud Platform 15
Process dictionary in parallel by first letter: ⅙ words start with ‘t’ ⇒ < 6x speedup
Where do stragglers come from?
Spuriously slow external RPCs Bugs Join Foos / Bars, in parallel by Foos. Some Foos have ≫ Bars than others. Bad machines Bad network Resource contention Uneven resources Noise Uneven partitioning Uneven complexity
Google Cloud Platform 16
Oversplit Hand-tune Use data statistics
What would you do?
Uneven resources Backups Restarts Noise Uneven partitioning Uneven complexity Predictive ⇒ Unreliable Weak
Google Cloud Platform 17
These kinda work. But not really.
Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost always tuned wrong Statistics often missing / wrong Doesn’t exist for intermediate data Size != complexity Backups/restarts only address slow workers
Confidential & Proprietary Google Cloud Platform 18
Upfront heuristics don’t work: will predict wrong. Higher scale → more likely.
Confidential & Proprietary Google Cloud Platform 19
High scale triggers worst-case behavior.
Corollary: If you’re bottlenecked by worst-case behavior, you won’t scale.
03.1 Dynamic rebalancing
How it works
Google Cloud Platform 21
Detect and fight stragglers
Workers Time
Google Cloud Platform 22
What is a straggler, really?
Workers Time
Slower than perfectly-parallel: tend > sum(tend) / N
Google Cloud Platform 23
Split stragglers, return residuals into pool of work
Workers Time
Now Avg completion time 100 130 200 foo.txt 170 100 200 170 170 keep running schedule (cheap, atomic)
Google Cloud Platform 24
Rinse, repeat (“liquid sharding”)
Workers Time
Now Avg completion time
Google Cloud Platform 25
1 ParDo Skewed 24 workers ParDo/GBK/ParDo Uniform 400 workers
50% 25%
Confidential & Proprietary Google Cloud Platform 26
Get out of trouble > avoid trouble
Adaptive > Predictive
03.2 Dynamic rebalancing
Why is it hard?
Google Cloud Platform 28
And that’s it? What’s so hard?
Wait-free Perfect granularity What can be split? Data consistency Not just files APIs Non-uniform density Stuckness “Dark matter” Making predictions Testing consistency Debugging Measuring quality Being sure it works Semantics Quality
Google Cloud Platform 29
What is splitting
foo.txt [100, 200) foo.txt [100, 170) foo.txt [170, 200) split at 170
Google Cloud Platform 30
What is splitting: Associativity
[A, B) + [B, C) = [A, C)
Google Cloud Platform 31
What is splitting: Rounding up
[A, B) = records starting in [A, B) Random access ⇒ Can split without scanning data!
Google Cloud Platform 32
What is splitting: Rounding up
[A, B) = records starting in [A, B) Random access ⇒ Can split without scanning data!
apple beet fig grape kiwi lime pear rose squash vanilla
[a, h) [h, s) [s, $)
apple beet fig grape kiwi lime pear rose squash vanilla
Google Cloud Platform 33
What is splitting: Blocks
[A, B) = records in blocks starting in [A, B)
Google Cloud Platform 34
What is splitting: Readers
foo.txt [100, 200) foo.txt [100, 170) foo.txt [170, 200) split at 170 “Reader”
Re-reading consistency: continue until EOF = re-read shard
Google Cloud Platform 35
Dynamic splitting: readers
read not yet read not ok
- k
e.g. can’t split an arbitrary SQL query X = last record read: Exact, Increasing
Confidential & Proprietary Google Cloud Platform 36
[A, B) = blocks of records starting in [A, B) [A, B) + [B, C) = [A, C) Random access ⇒ No scanning needed to split Reading repeatable, ordered by position, positions exact
Google Cloud Platform 37
Concurrency when splitting
Read Process
...
time
should I split? ? ? ? While we wait, 1000s of workers idle. Per-element processing in O(hours) is common!
Google Cloud Platform 38
Concurrency when splitting
Read Process
...
Read Process
... split! ok. Split wait-free (but race-free), while processing/reading. see code: RangeTracker Per-element processing in O(hours) is common! split! ok. should I split? ? ? ? While we wait, 1000s of workers idle.
Google Cloud Platform 39
Perfectly granular splitting
“Few records, heavy processing” is common. ⇒ Perfect parallelism required
Confidential & Proprietary Google Cloud Platform 40
Separation: ParDo { record → sleep(∞) } parallelized perfectly
(requires wait-free + perfectly granular)
Google Cloud Platform 41
Separation is a qualitative improvement
/path/to/foo*.txt ParDo: expand glob
foo5.txt foo42.txt foo8.txt foo100.txt foo91.txt
ParDo: read records perfectly parallel
- ver files
perfectly parallel
- ver records
infinite scalability (no “shard per file”)
foo26.txt foo87.txt foo56.txt
See also: Splittable DoFn http://s.apache.org/splittable-do-fn
Confidential & Proprietary Google Cloud Platform 42
“Practical” solutions improve performance “No compromise” solutions reduce dimension
- f the problem space
Google Cloud Platform 43
Google Cloud Platform 44
Making predictions: easy, right?
100 130 200 ~30% complete: 130 / [100, 200) = 0.3 Split at 70%: 0.7 [100, 200) = 170
apple beet fig grape kiwi
~50% complete: k / [a, z) ≈ 0.5 Split at 70%: 0.7 [a, z) ≈ t t 70%
Google Cloud Platform 45
100% t Progress 100% t Progress 100% t Progress 100% t Progress 100% Progress t
Easy; usually too good to be true.
100% t100% t Progress tx px
Confidential & Proprietary Google Cloud Platform 46
Accurate predictions = wrong goal, infeasible. Wildly off ⇒ System should still work Optimize for emergent behavior (separation) Better goal: detect stuckness
Google Cloud Platform 47
Heavy work that you don’t know exists, until you hit it. Goal: discover and distribute dark matter as quickly as possible. (Image credit: NASA)
Dark matter
47
04 Autoscaling
Why dynamic rebalancing really matters
49
How much work will there be? Can’t predict: data size, complexity, etc. What should you do? Adaptive > Predictive. Keep re-estimating total work; scale up/down (Image credit: Wikipedia)
A lot of work ⇒ A lot of workers
49
50
Start off with 3 workers, things are looking okay 10m 3 days Re-estimation ⇒ orders of magnitude more work: need 100 workers! 92 workers idle 100 workers useless without 100 pieces of work!
Google Cloud Platform 51
Now scaling up is no big deal! Add workers Work distributes itself Job smoothly scales 3 → 1000 workers.
Autoscaling + dynamic rebalancing
Waves of splitting Upscaling & VM startup
05 If you remember two things
Philosophy of everything above
Google Cloud Platform 53
Reducing dimension > Incremental improvement “Corner cases” are clues that you’re still compromising
If you remember two things
Adaptive > Predictive “No compromise” solutions matter Fighting stragglers > Preventing stragglers Emergent behavior > Local precision
wait-free perfectly granular separation heavy records reading-as-ParDo rebalancing autoscaling reusability
Confidential & Proprietary Google Cloud Platform 54
Thank you Q&A
Google Cloud Platform 55
Apache Beam No shard left behind: Dynamic work rebalancing in Cloud Dataflow Comparing Cloud Dataflow Autoscaling to Spark and Hadoop Splittable DoFn Documentation on Dataflow/Beam source APIs