PerfDebug: Performance Debugging of Computation Skew in Dataflow Systems
Jason Teoh, Muhammad Ali Gulzar, Harry Xu, Miryung Kim University of California, Los Angeles
PerfDebug: Performance Debugging of Computation Skew in Dataflow - - PowerPoint PPT Presentation
PerfDebug: Performance Debugging of Computation Skew in Dataflow Systems Jason Teoh, Muhammad Ali Gulzar, Harry Xu, Miryung Kim University of California, Los Angeles Motivating Example Server Logs Cron Day 1 20GB Web Server Anomaly
Jason Teoh, Muhammad Ali Gulzar, Harry Xu, Miryung Kim University of California, Los Angeles
2
20GB
Web Server Anomaly Detection
Cron Day 1
Server Logs
3
20GB
Web Server Anomaly Detection
Cron Day 1 Execution Time : 28 s
Server Logs
4
20GB 20GB
Web Server Anomaly Detection
Cron Day 1 Cron Day 2 Execution Time : 25 s Execution Time : 28 s
Server Logs
5
20GB 20GB 20GB
Web Server Anomaly Detection Server Logs
Cron Day 1 Cron Day 2 Cron Day 3 Execution Time : 92 s Execution Time : 25 s Execution Time : 28 s
6
20GB 20GB 20GB
Web Server Anomaly Detection Server Logs
Cron Day 1 Cron Day 2 Cron Day 3 Execution Time : 92 s Execution Time : 25 s Execution Time : 28 s
7
20GB 20GB 20GB
Web Server Anomaly Detection Server Logs
Cron Day 1 Cron Day 2 Cron Day 3 Execution Time : 92 s Execution Time : 25 s Execution Time : 28 s
Why does my job run slowly for day 3’s data?
8
Worker1 Worker2 Worker3
Uneven distribution of data across partitions, tasks, or workers can lead to performance delays.
9
Term Hello World Big Data Debugging PerfDebug Term Latency Hello World 2 ms Big Data 1 ms Debugging 3 ms PerfDebug 442 ms
commonDefs = { “Hello World”: ..,, “Big Data”: ..,, “Debugging”: ..., ... } if (commonDefs.contains(term)) { return commonDefs.get(term) } else { r = new r = new RedisClient RedisClient(…) (…) return return r.get r.get(term) (term) }
User-defined function Uneven distribution of computation due to interactions between data and application code.
10
Output: Individual records responsible for computation skew
14
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification Input: Spark program, input data
Output: Individual records responsible for computation skew
15
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification Input: Spark program, input data
17
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Output: Individual records responsible for computation skew
18
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification Input: Spark program, input data
Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings.
19
Stage 1
lines map reduceByKey (map-side)
Stage 2
reduceByKey (reduce-side) map
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings.
20
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
Stage 2
reduceByKey (reduce-side) map
Input ID Output ID 100
200
… … Input ID Output ID
id1
id2
id3 … …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings.
21
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
Stage 2
reduceByKey (reduce-side) map
Input ID Output ID 100
200
… … Input ID Output ID
id1
id2
id3 … …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings.
22
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
Stage 2
reduceByKey (reduce-side) map
Input ID Output ID 100
200
… … Input ID Output ID
id1
id2
id3 … …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings.
23
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID {id1, id3}(0, 100) {id2} (0, 200) … … Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
Stage 2
reduceByKey (reduce-side) map
Input ID Output ID 100
200
… … Input ID Output ID
id1
id2
id3 … …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings.
24
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID {id1, id3}(0, 100) {id2} (0, 200) … … Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
Stage 2
reduceByKey (reduce-side) map
Input ID Output ID 100
200
… … Input ID Output ID
id1
id2
id3 … …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
PerfDebug extends Titian by capturing summed UDF execution times.
25
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
Stage 2
reduceByKey (reduce-side) map
Input ID Output ID 100
200
… … Input ID Output ID
id1
id2
id3 … …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage 2
Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … …
PerfDebug extends Titian by capturing summed UDF execution times.
26
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
reduceByKey (reduce-side) map
Input ID Output ID
id1
id2
id3 … …
7 ms 3 ms
Input ID Output ID 100
200
… …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage 2
Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID UDF Latency {id1, id3} (0, 100) 7 + 3 = 10 ms {id2} (0, 200) … …
PerfDebug extends Titian by capturing summed UDF execution times.
27
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
reduceByKey (reduce-side) map
Input ID Output ID
id1
id2
id3 … …
7 ms 3 ms
Input ID Output ID 100
200
… …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage 2
Input ID Output ID 100
200
… … Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID UDF Latency 100
30 ms 200
40 ms … … … Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … …
PerfDebug extends Titian by capturing summed UDF execution times.
28
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
reduceByKey (reduce-side) map
Input ID Output ID
id1
id2
id3 … …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage 2
Input ID Output ID 100
200
… … Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID UDF Latency 100
30 ms 200
40 ms … … … Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … …
PerfDebug captures data movement costs through partition-level shuffle latencies.
29
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
reduceByKey (reduce-side) map
Input ID Output ID
id1
id2
id3 … …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage 2
Input ID Output ID 100
200
… … Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID UDF Latency 100
30 ms 200
40 ms … … … Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … …
PerfDebug captures data movement costs through partition-level shuffle latencies.
30
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
reduceByKey (reduce-side) map
Input ID Output ID
id1
id2
id3 … … Partition Shuffle Latency 1 80 ms 2 50 ms 3 100 ms … …
Stage 2
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage 2
Input ID Output ID 100
200
… … Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID UDF Latency 100
30 ms 200
40 ms … … … Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … …
PerfDebug calculates per-record stage latency by adding UDF latency and shuffle latency proportional to input count.
31
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
reduceByKey (reduce-side) map
Input ID Output ID
id1
id2
id3 … … Partition Shuffle Latency 1 80 ms 2 50 ms 3 100 ms … …
Stage 2
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stg Latency 10 + 0 ms 20 + 0 ms … Input ID Output ID UDF Latency 100
30 ms 200
40 ms … … …
Stage 2
Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … …
PerfDebug calculates per-record stage latency by adding UDF latency and shuffle latency proportional to input count.
32
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
reduceByKey (reduce-side) map
Input ID Output ID
id1
id2
id3 … … Partition Shuffle Latency 1 80 ms 2 50 ms 3 100 ms … …
Stage 2
Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stg Latency 10 ms 20 ms … Input ID Output ID UDF Latency 100
30 ms 200
40 ms … … …
Stage 2
Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … …
PerfDebug calculates per-record stage latency by adding UDF latency and shuffle latency proportional to input count.
33
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
reduceByKey (reduce-side) map
Input ID Output ID
id1
id2
id3 … … Partition Shuffle Latency 1 80 ms 2 50 ms 3 100 ms … …
Stage 2
Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … … Stg Latency …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stg Latency 10 ms 20 ms … Input ID Output ID UDF Latency 100
30 ms 200
40 ms … … …
Stage 2
Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … …
PerfDebug calculates per-record stage latency by adding UDF latency and shuffle latency proportional to input count.
34
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
reduceByKey (reduce-side) map
Input ID Output ID
id1
id2
id3 … … Partition Shuffle Latency 1 80 ms 2 50 ms 3 100 ms … …
Stage 2
Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … … Stg Latency 30 +
𝟑 𝟐𝟕 * 80
…
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stg Latency 10 ms 20 ms … Input ID Output ID UDF Latency 100
30 ms 200
40 ms … … …
Stage 2
Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … …
PerfDebug calculates per-record stage latency by adding UDF latency and shuffle latency proportional to input count.
35
Stage 1
lines map reduceByKey (map-side)
Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …
reduceByKey (reduce-side) map
Input ID Output ID
id1
id2
id3 … … Partition Shuffle Latency 1 80 ms 2 50 ms 3 100 ms … …
Stage 2
Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … … Stg Latency 40 ms 45 ms …
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Output: Individual records responsible for computation skew
36
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification Input: Spark program, input data
37
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage 1
38
InputID Output ID Stg Latency input1
40 ms input2
30 ms input3
25 ms input4
40 ms input5
55 ms input6
60 ms
Stage 2
InputID Output ID Stg Latency
65 ms
70 ms
40 ms
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage 1
39
InputID Output ID Stg Latency input1
40 ms input2
30 ms input3
25 ms input4
40 ms input5
55 ms input6
60 ms
Stage 2
InputID Output ID Stg Latency
65 ms
70 ms
40 ms
? ? What about the slowest records in each stage?
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage 1
40
InputID Output ID Stg Latency input1
40 ms input2
30 ms input3
25 ms input4
40 ms input5
55 ms input6
60 ms
Stage 2
InputID Output ID Stg Latency
65 ms
70 ms
40 ms
? ? Computation skew can span multiple stages!
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage K
41
InputID Output ID E2E Latency input1
40 ms input2
30 ms input3
25 ms input4
40 ms input5
55 ms input6
60 ms
Stage K+1
InputID Output ID Stg Latency
65 ms
70 ms
40 ms
PerfDebug propagates end-to-end latency by adding stage latency to the slowest (max) end-to-end latency of the previous stage.
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage K
42
InputID Output ID E2E Latency input1
40 ms input2
30 ms input3
25 ms input4
40 ms input5
55 ms input6
60 ms
Stage K+1
InputID Output ID Stg Latency
65 ms
70 ms
40 ms
PerfDebug propagates end-to-end latency by adding stage latency to the slowest (max) end-to-end latency of the previous stage.
E2E Latency 65 + max(40,55)
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage K
43
InputID Output ID E2E Latency input1
40 ms input2
30 ms input3
25 ms input4
40 ms input5
55 ms input6
60 ms
Stage K+1
InputID Output ID Stg Latency
65 ms
70 ms
40 ms
PerfDebug propagates end-to-end latency by adding stage latency to the slowest (max) end-to-end latency of the previous stage.
E2E Latency 65 + max(40,55) 70 + max(30,25) 40 + max(40,60)
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage K
44
InputID Output ID E2E Latency input1
40 ms input2
30 ms input3
25 ms input4
40 ms input5
55 ms input6
60 ms
Stage K+1
InputID Output ID E2E Latency
120 ms
100 ms
100 ms
PerfDebug propagates end-to-end latency by adding stage latency to the slowest (max) end-to-end latency of the previous stage.
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage K
45
InputID Output ID E2E Latency input1
40 ms input2
30 ms input3
25 ms input4
40 ms input5
55 ms input6
60 ms
Stage K+1
InputID Output ID E2E Latency
120 ms
100 ms
100 ms
PerfDebug propagates end-to-end latency by adding stage latency to the slowest (max) end-to-end latency of the previous stage.
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
46
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
47
Stage 1
InputID Output ID E2E Latency input1
40 ms input2
30 ms input3
25 ms input4
40 ms input5
55 ms input6
60 ms
Stage 2
InputID Output ID E2E Latency
120 ms
100 ms
100 ms
For each record within each stage, PerfDebug extends end-to-end latency by tracking the program input for the path of max latency.
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
48
Stage 1
InputID Output ID E2E Latency input1
40 ms input2
30 ms input3
25 ms input4
40 ms input5
55 ms input6
60 ms
Stage 2
InputID Output ID E2E Latency
120 ms
100 ms
100 ms
For each record within each stage, PerfDebug extends end-to-end latency by tracking the program input for the path of max latency.
input1 input2 input3 input4 input5 input6
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
49
Stage 1
InputID Output ID E2E Latency input1
40 ms input2
30 ms input3
25 ms input4
40 ms input5
55 ms input6
60 ms
Stage 2
InputID Output ID E2E Latency
120 ms
100 ms
100 ms
For each record within each stage, PerfDebug extends end-to-end latency by tracking the program input for the path of max latency.
input1 input2 input3 input4 input5 input6
input5
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
Stage 2
50
Stage 1
InputID Output ID E2E Latency input1
40 ms input2
30 ms input3
25 ms input4
40 ms input5
55 ms input6
60 ms InputID Output ID E2E Latency
120 ms
100 ms
100 ms
For each record within each stage, PerfDebug extends end-to-end latency by tracking the program input for the path of max latency.
input1 input2 input3 input4 input5 input6
input5 input2 input6
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification
51
Output: Individual records responsible for computation skew
PerfDebug
Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification Input: Spark program, input data
53
54
15-27 GB Data 10 Workers 1 Master 24GB Memory / worker
56
Worker1 Worker2
(Borough,Cost) Shuffle Average by borough
27 GB (173M rows)
Worker1 Worker2
220 s 400 s 15 s
Total runtime: ~7 minutes
20 s
57
(Borough,Cost) Shuffle Average by borough 27 GB (173M rows)
Worker1 Worker2
220 s 400 s 15 s
Task times show that data skew is a minor performance factor.
20 s
58
(Borough,Cost) Shuffle Average by borough 27 GB (173M rows)
Worker1 Worker2
220 s 400 s 15 s
PerfDebug detects potential computation skew in the first stage.
20 s
59
(Borough,Cost) Shuffle Average by borough 27 GB (173M rows)
60
Worker1 Worker2
PerfDebug identifies the outputs with the highest latency and uses provenance to trace the corresponding inputs.
Latency Heatmap
(Borough,Cost) Shuffle Average by borough 27 GB (173M rows)
61
Worker1 Worker2
PerfDebug identifies the outputs with the highest latency and uses provenance to trace the corresponding inputs.
Latency Heatmap
(Borough,Cost) Shuffle Average by borough 27 GB (173M rows)
62
Worker1 Worker2
PerfDebug identifies the outputs with the highest latency and uses provenance to trace the corresponding inputs.
Latency Heatmap
(Borough,Cost) Shuffle Average by borough 27 GB (173M rows)
63
Removal of these records results in ~16X performance improvement.
64
Benchmark Accuracy Precision Improvement Overhead Movie Ratings 100% 103 X 1.04X College Students 100% 106 X 1.39X Weather Analysis 100% 102 X 1.48X Average 100% 105 X 1.30X
65
66
67