SLIDE 1 Software Engineer Google
Manuel Fahndrich
Streaming Auto-Scaling in Google Cloud Dataflow
SLIDE 2 https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
Addictive Mobile Game
SLIDE 3 1,251,965 1,019,341 989,673 151,365 109,903 98,736
Team Ranking Individual Ranking
Sarah Joe Milo
Hourly Ranking Daily Ranking
SLIDE 4 An Unbounded Stream of Game Events
9:00 8:00 14:00 13:00 12:00 11:00 10:00 2:00 1:00 7:00 6:00 5:00 4:00 3:00
SLIDE 5 … with unknown delays.
9:00 8:00 14:00 13:00 12:00 11:00 10:00 8:00
8:00 8:00 8:00
SLIDE 6 The Resource Allocation Problem
time
workload
resources
time
workload under-provisioned resources
resources resources
SLIDE 7 Matching Resources to Workload
time
workload auto-tuned resources
resources
SLIDE 8 Resources = Parallelism
time
workload auto-tuned parallelism
parallelism
More generally: VMs (including CPU, RAM, network, IO).
SLIDE 9
Assumptions
Big Data Problem Embarrassingly Parallel Scaling VMs ==> Scales Throughput Horizontal Scaling
SLIDE 10 Agenda
Streaming Dataflow Pipelines Pipeline Execution Adjusting Parallelism Automatically Summary + Future Work 1 2 3 4
SLIDE 11 Streaming Dataflow
1
SLIDE 12 2012 2002 2004 2006 2008 2010
MapReduce
GFS Big Table Dremel Pregel
FlumeJava
Colossus Spanner
2014
MillWheel
Dataflow
2016
Google’s Data-Related Systems
SLIDE 13 Google Dataflow SDK
Open Source SDK used to construct a Dataflow pipeline. (Now Incubating as Apache Beam)
SLIDE 14 Computing Team Scores
// Collection of raw log lines PCollection<String> raw = ...; // Element-wise transformation into team/score // pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn())) // Composite transformation containing an // aggregation PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(60)))) .apply(Sum.integersPerKey());
SLIDE 15 Google Cloud Dataflow
- Given code in Dataflow (incubating as Apache Beam)
SDK...
○ On your development machine ○ On the Dataflow Service on Google Cloud Platform ○ On third party environments like Spark or Flink.
SLIDE 16 Cloud Dataflow
A fully-managed cloud service and programming model for batch and streaming big data processing.
Google Cloud Dataflow
SLIDE 17 Google Cloud Dataflow
Optimize Schedule
GCS GCS
SLIDE 18 time
workload auto-tuned parallelism
parallelism
Back to the Problem at Hand
SLIDE 19
Signals measuring Workload Policy making Decisions Mechanism actuating Change
Auto-Tuning Ingredients
SLIDE 20 Pipeline Execution
2
SLIDE 21 S0 S2 S1
Optimized Pipeline = DAG of Stages
raw input Individual points team points
SLIDE 22 S0 S2 S1
Stage Throughput Measure
raw input Individual points team points throughput throughput throughput
SLIDE 23 Picture by Alexandre Duret-Lutz, Creative Commons 2.0 Generic
SLIDE 24 S0 S2 S1
Queues of Data Ready for Processing Queue Size = Backlog
SLIDE 25
vs. Backlog Growth Backlog Size
SLIDE 26
Backlog Growth = Processing Deficit
SLIDE 27 S1
Derived Signal: Stage Input Rate
throughput
Input Rate = Throughput + Backlog Growth
backlog growth
SLIDE 28
Constant Backlog... ...could be bad
SLIDE 29
Backlog Time =
Backlog Size Throughput
SLIDE 30
Backlog Time = Time to get through backlog
SLIDE 31
Bad Backlog = Long Backlog Time
SLIDE 32
Backlog Growth and Backlog Time Inform Upscaling. What Signals indicate Downscaling?
SLIDE 33
Low CPU Utilization
SLIDE 34
Throughput Backlog growth Backlog time CPU utilization Signals Summary
SLIDE 35 Goals:
- 1. No backlog growth
- 2. Short backlog time
- 3. Reasonable CPU utilization
Policy: making Decisions
SLIDE 36
Upscaling Policy: Keeping Up
Given M machines For a stage, given: average stage throughput T average positive backlog growth G of stage Machines needed for stage to keep up: (T + G) T M’ = M
SLIDE 37
Upscaling Policy: Catching Up
Given M machines Given R (time to reduce backlog) For a stage, given: average backlog time B Extra machines to remove backlog: B R Extra = M
SLIDE 38 Upscaling Policy: All Stages
Want all stages to:
- 1. keep up
- 2. have log backlog time
Pick Maximum over all stages of M’ + Extra
SLIDE 39 Example (signals)
input rate throughput backlog growth backlog time MB/s seconds
SLIDE 40 Example (signals)
input rate throughput backlog growth backlog time MB/s seconds
SLIDE 41 Example (signals)
input rate throughput backlog growth backlog time MB/s seconds
SLIDE 42 Example (signals)
input rate throughput backlog growth backlog time MB/s seconds
SLIDE 43 Example (policy)
M’ M Extra R=60s
machines
SLIDE 44 Example (policy)
M’ M
machines
Extra R=60s
SLIDE 45 Example (policy)
M’ M
machines
Extra R=60s
SLIDE 46 Example (policy)
M’ M
machines
Extra R=60s
SLIDE 47
Preconditions for Downscaling Low backlog time No backlog growth Low CPU utilization
SLIDE 48 How far can we downscale?
Stay tuned...
SLIDE 49 Adjusting Parallelism of a Running Streaming Pipeline
Mechanism: actuating Change
3
SLIDE 50 S0 S2 S1
Optimized Pipeline = DAG of Stages
SLIDE 51 S0 S2 S1
Optimized Pipeline = DAG of Stages
SLIDE 52 S0 S2 S1
Optimized Pipeline = DAG of Stages
Machine 0
SLIDE 53 Adding Parallelism
Machine 0 S0 S2 S1 S0 S2 S1 S0 S2 S1
SLIDE 54 Adding Parallelism
S0 S2 S1 S0 S2 S1 Machine 0 Machine 1
SLIDE 55 Adding Parallelism = Splitting Key Ranges
S0 S2 S1 S0 S2 S1 Machine 0 Machine 1
SLIDE 56
Migrating a Computation
SLIDE 57 Adding Parallelism = Migrating Computation Ranges
S0 S2 S1 S0 S2 S1 Machine 0 Machine 1
SLIDE 58
Checkpoint and Recovery ~ Computation Migration
SLIDE 59 Key Ranges and Persistence
S0 S2 S1 Machine 0 S0 S2 S1 Machine 1 S0 S2 S1 Machine 3 S0 S2 S1 Machine 2
SLIDE 60 Downscaling from 4 to 2 Machines
S0 S2 S1 Machine 0 S0 S2 S1 Machine 1 S0 S2 S1 Machine 3 S0 S2 S1 Machine 2
SLIDE 61 Downscaling from 4 to 2 Machines
S0 S2 S1 Machine 0 S0 S2 S1 Machine 1 S0 S2 S1 Machine 3 S0 S2 S1 Machine 2
SLIDE 62 Downscaling from 4 to 2 Machines
S0 S2 S1 Machine 0 S0 S2 S1 Machine 1
SLIDE 63 Downscaling from 4 to 2 Machines
S0 S2 S1 Machine 1 S0 S2 S1 Machine 2
Upsizing = Steps in Reverse
SLIDE 64 Granularity of Parallelism
As of March 2016, Google Cloud Dataflow:
- Splits Key Ranges initially Based on Max Machines
- At Max: 1 Logical Persistent Disk per Machine
Each disk has slice of key ranges from all stages
- Only (relatively) even Disk Distributions
- Results in Scaling Quanta
SLIDE 65 Parallelism Disk per Machine 3 N/A 4 15 5 12 6 10 7 8, 9 8 7, 8 9 6, 7 10 6 12 5 15 4 20 3 30 2 60 1
Example Scaling Quanta: Max = 60 Machines
SLIDE 66 Goals:
- 1. No backlog growth
- 2. Short backlog time
- 3. Reasonable CPU utilization
Policy: making Decisions
SLIDE 67
Preconditions for Downscaling Low backlog time No backlog growth Low CPU utilization
SLIDE 68
Next lower scaling quanta => M’ machines Estimate future CPUM’ per machine: If new CPUM’ < threshold (say 90%), downscale to M’
Downscaling Policy
M M’ CPUM’ = CPUM
SLIDE 69 Summary + Future Work
4
SLIDE 70
Artificial Experiment
SLIDE 71
Auto-Scaling Summary
Signals: throughput, backlog time, backlog growth, CPU utilization Policy: keep up, reduce backlog, use CPUs Mechanism: split key ranges, migrate computations
SLIDE 72
- Experiment with non-uniform disk distributions to
address hot ranges
- Dynamically splitting ranges finer than initially done.
- Approximate model of #VM - throughput relation
Future Work
SLIDE 73 Questions?
Further reading on streaming model: The world beyond batch: Streaming 101 The world beyond batch: Streaming 102