UNDERSTANDING DISTRIBUTED DATAFLOW SYSTEMS
OUTPUT EXPLANATION AND PERFORMANCE ANALYSIS
John Liagouris liagos@inf.ethz.ch
3 May 2017
UNDERSTANDING DISTRIBUTED DATAFLOW John Liagouris - - PowerPoint PPT Presentation
UNDERSTANDING DISTRIBUTED DATAFLOW John Liagouris liagos@inf.ethz.ch SYSTEMS OUTPUT EXPLANATION AND PERFORMANCE ANALYSIS Google 3 May 2017 PART I: Why is this record in the output of my distributed dataflow? Concise explanations of
3 May 2017
2
Desislava Dimitrova Vasiliki Kalavri Zaheer Chothia Moritz Hoffmann Andrea Lattuada Timothy Roscoe Sebastian Wicki Frank McSherry
Ralf Sager 3
4
event logs Enterprise Datacenter Strymon
5
event logs Enterprise Datacenter Strymon
input stream
iterative analytics worker 1 worker 2
synchronous vs asynchronous shared-nothing vs shared-memory
streaming analytics
6
▸ A steaming framework for data-parallel computations ▸ Cyclic dataflows ▸ Logical timestamps (epochs) ▸ Asynchronous execution ▸ Low latency
Naiad: A Timely Dataflow System. In SOSP, 2013.
7
Differential Dataflow. In CIDR, 2013.
8
9
1 2 3
COMPUTATION PROVENANCE
10
INPUT OUTPUT
11
INPUT OUTPUT
THIS RECORD LOOKS WRONG!
{App 115 344} {VM 233
{App 100 55} {VM 333
… … …
{A 115 344} {F 233 122} {W 100
{V 30 23} … … …
12
INPUT OUTPUT
THIS RECORD LOOKS WRONG!
{A 115 344} {F 233 122} {W 100
{V 30 23} … … …
{App 115 344} {VM 233
{App 100 55} {VM 333
… … …
13
INPUT OUTPUT
THIS RECORD LOOKS WRONG!
{A 115 344} {F 233 122} {W 100
{V 30 23} … … …
Output explanation: A subset of the input that is sufficient to reproduce the selected subset of the output
{App 115 344} {VM 233
{App 100 55} {VM 333
… … …
14
1 2 3 metadata propagation
15
1’ 2’ 3’
16
1 2 3 dependencies
17
1 5 2 3 4
18
1 5 2 3 4
WHY IS (1,3) IN THE OUTPUT?
19
1 5 2 3 4
WHY IS (1,3) IN THE OUTPUT?
20
1 5 2 3 4
WHY IS (1,3) IN THE OUTPUT?
21
THE QUICK BROWN FOX … THE LAZY DOG …
22
THE QUICK BROWN FOX … THE LAZY DOG …
WHY ONLY 3 WORDS ARE UNIQUE TO DOCUMENT A?
(doc A, 3 unique words) (doc B, 2 unique words)
23
THE QUICK BROWN FOX … THE LAZY DOG …
WHY ONLY 3 WORDS ARE UNIQUE TO DOCUMENT A?
(doc A, 3 unique words) (doc B, 2 unique words)
24
THE QUICK BROWN FOX … THE LAZY DOG …
WHY ONLY 3 WORDS ARE UNIQUE TO DOCUMENT A?
(doc A, 3 unique words) (doc B, 2 unique words)
25
Yes! Given that the system is able to:
▸ Keep track of the exact point in the computation
a data record was produced
▸ Detect divergent records when replaying the
computation on a subset of the input
26
Op A Op B Op C
INPUT OUTPUT
27
Op A Op B Op C
INPUT OUTPUT
Join Join Join
INPUT OUTPUT
28
Join Join Join
QUERY EXPL
Op A Op B Op C
INPUT OUTPUT
29
Join Join Join
QUERY EXPL
Op A Op B Op C
INPUT OUTPUT
30
Join Join Join
QUERY EXPL
Op A Op B Op C
INPUT OUTPUT
31
Join Join Join
QUERY EXPL
Op A Op B Op C
INPUT OUTPUT
Trace divergent records backwards
k1 v k2 v’ … … k1 v k2 v’’ … …
32
Join Join Join
QUERY EXPL
Op A Op B Op C
INPUT OUTPUT
THE QUICK BROWN FOX … THE LAZY DOG …
33 33
THE QUICK BROWN FOX … THE LAZY DOG …
34
MAP MAP
(THE, A) (BROWN, A) (FOX, A) (THE, B) (LAZY, B) (DOG, B)
THE QUICK BROWN FOX … THE LAZY DOG …
MAP MAP
(THE, A) (BROWN, A) (FOX, A) (THE, B) (LAZY, B) (DOG, B)
GROUP
(THE, [A,B]) (BROWN, A) (FOX, A) (LAZY, B) (DOG,B)
35
THE QUICK BROWN FOX … THE LAZY DOG …
MAP MAP
(THE, A) (BROWN, A) (FOX, A) (THE, B) (LAZY, B) (DOG, B)
GROUP
(THE, [A,B]) (BROWN, A) (FOX, A) (LAZY, B) (DOG,B)
FILTER
(BROWN,A) (FOX,A) (LAZY,B) (DOG,B)
36
THE QUICK BROWN FOX … THE LAZY DOG …
MAP MAP
(THE, A) (BROWN, A) (FOX, A) (THE, B) (LAZY, B) (DOG, B)
GROUP
(THE, [A,B]) (BROWN, A) (FOX, A) (LAZY, B) (DOG,B)
FILTER
(BROWN,A) (FOX,A) (LAZY,B) (DOG,B)
37
GROUP
(A, 3) (B, 2)
THE QUICK BROWN FOX … THE LAZY DOG …
MAP MAP
(THE, A) (BROWN, A) (FOX, A) (THE, B) (LAZY, B) (DOG, B)
GROUP
(THE, [A,B]) (BROWN, A) (FOX, A) (LAZY, B) (DOG,B)
FILTER
(BROWN,A) (FOX,A) (LAZY,B) (DOG,B)
38
GROUP
(A, 3) (B, 2)
THE QUICK BROWN FOX … THE LAZY DOG …
MAP MAP
(THE, A) (BROWN, A) (FOX, A) (THE, B) (LAZY, B) (DOG, B)
GROUP
(THE, [A,B]) (BROWN, A) (FOX, A) (LAZY, B) (DOG,B)
FILTER
(BROWN,A) (FOX,A) (LAZY,B) (DOG,B)
39
GROUP
(A, 3) (B, 2)
(THE, A)
THE QUICK BROWN FOX … THE LAZY DOG …
MAP MAP
(THE, A) (BROWN, A) (FOX, A) (THE, B) (LAZY, B) (DOG, B)
GROUP
(THE, [A,B]) (BROWN, A) (FOX, A) (LAZY, B) (DOG,B)
FILTER
(BROWN,A) (FOX,A) (LAZY,B) (DOG,B)
40
GROUP
(A, 3) (B, 2) (THE, A)
41
▸ Dataset: A subset of the Twitter graph with 1B edges ▸ Algorithm: Label propagation ▸ Output: Records of the form (A,B) denoting that nodes A and B belong
to the same connected component
▸ System used: Differential Dataflow ▸ Machine used: Intel Xeon E5-4640 at 2.4GHz with 32 cores and 500G
RAM
More results:
in Modern Data Analytics PVDLB 9(12):1137-1148, 2016.
42
43
client
W1 W1
44 Apache Flink Naiad
client
W1 W1
45
46
47
48
49
50
51
52
The critical path is constructed by starting from the last event and backtracking:
53
54
▸ tumbling, sliding or custom windows
ts te w1 w2 w1 w2
a b c d e f g h i b’ c’ d’ e’ f’ g’ h’ i’ 1 2 1 1 1 3 2 4 2 1 1 1 1 1 3 1 1
Input Trace Snapshot in [ts,te]
ts te w1 w2 w1 w2
a b c d e f g h i b’ c’ d’ e’ f’ g’ h’ i’ 1 2 1 1 1 3 2 4 2 1 1 1 1 1 3 1 1
55
b’ c’ d’ e’ g’ h’ b’ c’ d’ g’ h’ i’ c’ d’ e’ g’ h’ f’ c’ d’ g’ h’ f’ i’ b’ g’ h’ i’ g’ h’ f’ i’
Snapshot in [ts,te] Transient Critical Paths
▸ All TCPs are possible parts of the unknown global critical path
in the snapshot [ts,te]{
56
b’ c’ d’ e’ g’ h’ b’ c’ d’ g’ h’ i’ c’ d’ e’ g’ h’ f’ c’ d’ g’ h’ f’ i’ b’ g’ h’ i’ g’ h’ f’ i’
TPC(d’,i’) = 2 TPC(g’,h’) = 6
57
b’ c’ d’ e’ g’ h’ b’ c’ d’ g’ h’ i’ c’ d’ e’ g’ h’ f’ c’ d’ g’ h’ f’ i’ b’ g’ h’ i’ g’ h’ f’ i’
CP(d’,i’) = 2*1/6*5 = 0.066 CP(g’,h’) = 6*1/6*5 = 0.2
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3
number of transient critical paths edge weight
58 “RDDs” “DataStreams” “Spouts and Bolts”
▸ data transformation ▸ data exchange ▸ control messages ▸ I/O ▸ data (de)-serialization ▸ buffer management ▸ scheduling
“Tensors” Naiad
59
▸ Benchmark: TPC-DS [1] ▸ System under study: Spark (1.2.1) ▸ Setting: 20 machines with 8 workers each ▸ We actually used Spark logs from [2] ▸ Snapshot interval: 10 sec
[2] Ousterhout, K. Spark performance analysis (accessed: April 2017) https://kayousterhout.github.io/trace- analysis/ [1] TPC-DS. http://www.tpc.org/tpcds/
60
200 400 Snapshot 0.0 0.2 0.4 0.6 0.8 1.0 CP 200 400 Snapshot % weight Shuffling Processing Serialization Deserialization ControlMessage Unknown
61
200 400 Snapshot 0.0 0.2 0.4 0.6 0.8 1.0 CP 200 400 Snapshot % weight Shuffling Processing Serialization Deserialization ControlMessage Unknown
“Optimizing disk usage can improve performance by a median of at most 19%”
Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., and Chun, B.-G. Making sense of performance in data analytics frameworks. In NSDI (2015).
62 reference application Strymon
Timely
event logs
path analysis
real-time performance summaries
adaptive scheduling dynamic scaling straggler mitigation
feedback dynamic job management
63
64
reference application Strymon
Timely
event logs
path analysis
real-time performance summaries adaptive scheduling dynamic scaling straggler mitigation feedback real-time job performance management
PART I: Iterative Backward Tracing
Part II: Transient Critical Path Analysis
concise explanations
guarantees
real-time performance summaries
transient critical paths
IN OUT OUT IN
interactive times continuous computations
3 May 2017 Google
66
SNOMED CT GALEN8
67
SNOMED CT GALEN8
68
SNOMED CT GALEN8
69
70
71
▸ Dataset: A subset of the Twitter graph with 300M edges ▸ Algorithm: Stable Matching ▸ Output: Records of the form (A,B) denoting that nodes A and B
matched
▸ System used: Differential Dataflow ▸ Machine used: Intel Xeon E5-4640 at 2.4GHz with 32 cores and 500G
RAM
72
73
74
75
▸ Benchmark: Yahoo Streaming Benchmark (YSB) [1] ▸ System under study: Flink (1.2.0) ▸ Setting: 1 machine with 8 workers ▸ Snapshot interval: 1 sec
[1] Yahoo Streaming Benchmark. https://github.com/yahoo/streaming-benchmarks
76
100 200 300 Snapshot 0.0 0.2 0.4 0.6 0.8 1.0 CP 100 200 300 Snapshot Single path CP Input Buffer Scheduling Processing BarrierProcessing Serialization Deserialization ControlMessage DataMessage Unknown
77
▸ Benchmark: AlexNet program [1] on ImageNet [2] ▸ System under study: TensorFlow (1.0.1) ▸ Setting: 1 machine 16 workers (CPU threads) ▸ Snapshot interval: 1 sec
Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012,
[1] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition
[2]
Activities: Unknown ControlMessage DataMessage Operator Categories: Math Primitives State and Initialization Transformations Machine Learning
78
1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 CP 50 100 20 40 Snapshot
79
200 400 Snapshot 0.0 0.2 0.4 0.6 0.8 1.0 CP 200 400 Snapshot % weight Shuffling Processing Serialization Deserialization ControlMessage Unknown 20 40 Snapshot 0.0 0.2 0.4 0.6 0.8 1.0 CP
80
81
50 100 Snapshot 0.0 0.1 0.2 0.3 CP 50 100 Snapshot % weight