PerfDebug: Performance Debugging of Computation Skew in Dataflow - - PowerPoint PPT Presentation

perfdebug performance debugging of computation skew in
SMART_READER_LITE
LIVE PREVIEW

PerfDebug: Performance Debugging of Computation Skew in Dataflow - - PowerPoint PPT Presentation

PerfDebug: Performance Debugging of Computation Skew in Dataflow Systems Jason Teoh, Muhammad Ali Gulzar, Harry Xu, Miryung Kim University of California, Los Angeles Motivating Example Server Logs Cron Day 1 20GB Web Server Anomaly


slide-1
SLIDE 1

PerfDebug: Performance Debugging of Computation Skew in Dataflow Systems

Jason Teoh, Muhammad Ali Gulzar, Harry Xu, Miryung Kim University of California, Los Angeles

slide-2
SLIDE 2

2

20GB

Web Server Anomaly Detection

Cron Day 1

Motivating Example

Server Logs

slide-3
SLIDE 3

3

20GB

Web Server Anomaly Detection

Cron Day 1 Execution Time : 28 s

Motivating Example

Server Logs

slide-4
SLIDE 4

4

20GB 20GB

Web Server Anomaly Detection

Cron Day 1 Cron Day 2 Execution Time : 25 s Execution Time : 28 s

Motivating Example

Server Logs

slide-5
SLIDE 5

5

20GB 20GB 20GB

Web Server Anomaly Detection Server Logs

Cron Day 1 Cron Day 2 Cron Day 3 Execution Time : 92 s Execution Time : 25 s Execution Time : 28 s

Motivating Example

slide-6
SLIDE 6

6

20GB 20GB 20GB

Web Server Anomaly Detection Server Logs

Cron Day 1 Cron Day 2 Cron Day 3 Execution Time : 92 s Execution Time : 25 s Execution Time : 28 s

Motivating Example

slide-7
SLIDE 7

7

20GB 20GB 20GB

Web Server Anomaly Detection Server Logs

Cron Day 1 Cron Day 2 Cron Day 3 Execution Time : 92 s Execution Time : 25 s Execution Time : 28 s

Motivating Example

Why does my job run slowly for day 3’s data?

slide-8
SLIDE 8

Data Skew in Distributed Processing

8

Worker1 Worker2 Worker3

Uneven distribution of data across partitions, tasks, or workers can lead to performance delays.

slide-9
SLIDE 9

Computation Skew

9

Term Hello World Big Data Debugging PerfDebug Term Latency Hello World 2 ms Big Data 1 ms Debugging 3 ms PerfDebug 442 ms

commonDefs = { “Hello World”: ..,, “Big Data”: ..,, “Debugging”: ..., ... } if (commonDefs.contains(term)) { return commonDefs.get(term) } else { r = new r = new RedisClient RedisClient(…) (…) return return r.get r.get(term) (term) }

User-defined function Uneven distribution of computation due to interactions between data and application code.

slide-10
SLIDE 10

Computation Skew

Why is it challenging?

  • Requires insight on how application code interacts with data.
  • Occurs across multiple stages.
  • Affected applications are inherently expensive to run.
  • Isolating individual records that impact performance is difficult with

existing tools.

10

slide-11
SLIDE 11

Output: Individual records responsible for computation skew

Performance Debugging of Computation Skew

14

PerfDebug

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification Input: Spark program, input data

slide-12
SLIDE 12

Output: Individual records responsible for computation skew

PerfDebug Approach

15

PerfDebug

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification Input: Spark program, input data

slide-13
SLIDE 13

Computation Skew Detection

  • PerfDebug monitors task-level metrics such as latency, garbage

collection, and serialization using SparkListener API.

  • If potential computation skew is found, rerun the user program in

debugging mode to collect additional information.

17

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-14
SLIDE 14

Output: Individual records responsible for computation skew

PerfDebug Approach

18

PerfDebug

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification Input: Spark program, input data

slide-15
SLIDE 15

Capture Data Provenance

Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings.

19

Stage 1

lines map reduceByKey (map-side)

Stage 2

reduceByKey (reduce-side) map

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-16
SLIDE 16

Capture Data Provenance

Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings.

20

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

Stage 2

reduceByKey (reduce-side) map

Input ID Output ID 100

  • utput1

200

  • utput2

… … Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-17
SLIDE 17

Capture Data Provenance

Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings.

21

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

Stage 2

reduceByKey (reduce-side) map

Input ID Output ID 100

  • utput1

200

  • utput2

… … Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-18
SLIDE 18

Capture Data Provenance

Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings.

22

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

Stage 2

reduceByKey (reduce-side) map

Input ID Output ID 100

  • utput1

200

  • utput2

… … Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-19
SLIDE 19

Capture Data Provenance

Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings.

23

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID {id1, id3}(0, 100) {id2} (0, 200) … … Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

Stage 2

reduceByKey (reduce-side) map

Input ID Output ID 100

  • utput1

200

  • utput2

… … Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-20
SLIDE 20

Capture Data Provenance

Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings.

24

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID {id1, id3}(0, 100) {id2} (0, 200) … … Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

Stage 2

reduceByKey (reduce-side) map

Input ID Output ID 100

  • utput1

200

  • utput2

… … Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-21
SLIDE 21

Measure UDF Latency

PerfDebug extends Titian by capturing summed UDF execution times.

25

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

Stage 2

reduceByKey (reduce-side) map

Input ID Output ID 100

  • utput1

200

  • utput2

… … Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-22
SLIDE 22

Stage 2

Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … …

Measure UDF Latency

PerfDebug extends Titian by capturing summed UDF execution times.

26

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

reduceByKey (reduce-side) map

Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … …

7 ms 3 ms

Input ID Output ID 100

  • utput1

200

  • utput2

… …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-23
SLIDE 23

Stage 2

Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID UDF Latency {id1, id3} (0, 100) 7 + 3 = 10 ms {id2} (0, 200) … …

Measure UDF Latency

PerfDebug extends Titian by capturing summed UDF execution times.

27

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

reduceByKey (reduce-side) map

Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … …

7 ms 3 ms

Input ID Output ID 100

  • utput1

200

  • utput2

… …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-24
SLIDE 24

Stage 2

Input ID Output ID 100

  • utput1

200

  • utput2

… … Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID UDF Latency 100

  • utput1

30 ms 200

  • utput2

40 ms … … … Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … …

Measure UDF Latency

PerfDebug extends Titian by capturing summed UDF execution times.

28

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

reduceByKey (reduce-side) map

Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-25
SLIDE 25

Stage 2

Input ID Output ID 100

  • utput1

200

  • utput2

… … Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID UDF Latency 100

  • utput1

30 ms 200

  • utput2

40 ms … … … Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … …

Measure Shuffle Latency

PerfDebug captures data movement costs through partition-level shuffle latencies.

29

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

reduceByKey (reduce-side) map

Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-26
SLIDE 26

Stage 2

Input ID Output ID 100

  • utput1

200

  • utput2

… … Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID UDF Latency 100

  • utput1

30 ms 200

  • utput2

40 ms … … … Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … …

Measure Shuffle Latency

PerfDebug captures data movement costs through partition-level shuffle latencies.

30

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

reduceByKey (reduce-side) map

Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … … Partition Shuffle Latency 1 80 ms 2 50 ms 3 100 ms … …

Stage 2

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-27
SLIDE 27

Stage 2

Input ID Output ID 100

  • utput1

200

  • utput2

… … Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … … Input ID Output ID UDF Latency 100

  • utput1

30 ms 200

  • utput2

40 ms … … … Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … …

Calculate Stage Latency

PerfDebug calculates per-record stage latency by adding UDF latency and shuffle latency proportional to input count.

31

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

reduceByKey (reduce-side) map

Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … … Partition Shuffle Latency 1 80 ms 2 50 ms 3 100 ms … …

Stage 2

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-28
SLIDE 28

Stg Latency 10 + 0 ms 20 + 0 ms … Input ID Output ID UDF Latency 100

  • utput1

30 ms 200

  • utput2

40 ms … … …

Stage 2

Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … …

Calculate Stage Latency

PerfDebug calculates per-record stage latency by adding UDF latency and shuffle latency proportional to input count.

32

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

reduceByKey (reduce-side) map

Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … … Partition Shuffle Latency 1 80 ms 2 50 ms 3 100 ms … …

Stage 2

Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-29
SLIDE 29

Stg Latency 10 ms 20 ms … Input ID Output ID UDF Latency 100

  • utput1

30 ms 200

  • utput2

40 ms … … …

Stage 2

Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … …

Calculate Stage Latency

PerfDebug calculates per-record stage latency by adding UDF latency and shuffle latency proportional to input count.

33

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

reduceByKey (reduce-side) map

Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … … Partition Shuffle Latency 1 80 ms 2 50 ms 3 100 ms … …

Stage 2

Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … … Stg Latency …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-30
SLIDE 30

Stg Latency 10 ms 20 ms … Input ID Output ID UDF Latency 100

  • utput1

30 ms 200

  • utput2

40 ms … … …

Stage 2

Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … …

Calculate Stage Latency

PerfDebug calculates per-record stage latency by adding UDF latency and shuffle latency proportional to input count.

34

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

reduceByKey (reduce-side) map

Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … … Partition Shuffle Latency 1 80 ms 2 50 ms 3 100 ms … …

Stage 2

Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … … Stg Latency 30 +

𝟑 𝟐𝟕 * 80

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-31
SLIDE 31

Stg Latency 10 ms 20 ms … Input ID Output ID UDF Latency 100

  • utput1

30 ms 200

  • utput2

40 ms … … …

Stage 2

Input ID Output ID {id1, id3} (0, 100) {id2} (0, 200) … …

Calculate Stage Latency

PerfDebug calculates per-record stage latency by adding UDF latency and shuffle latency proportional to input count.

35

Stage 1

lines map reduceByKey (map-side)

Input ID Output ID (0, 100) 100 (1, 100) 100 (0, 200) 200 … …

reduceByKey (reduce-side) map

Input ID Output ID

  • ffset1

id1

  • ffset2

id2

  • ffset3

id3 … … Partition Shuffle Latency 1 80 ms 2 50 ms 3 100 ms … …

Stage 2

Input ID Output ID UDF Latency {id1, id3} (0, 100) 10 ms {id2} (0, 200) 20 ms … … … Stg Latency 40 ms 45 ms …

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-32
SLIDE 32

Output: Individual records responsible for computation skew

PerfDebug Approach

36

PerfDebug

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification Input: Spark program, input data

slide-33
SLIDE 33

Expensive Record Identification

  • Stage Latency is within a given stage and insufficient for debugging.
  • Code and data interact across multiple stages.

37

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-34
SLIDE 34
  • Stage Latency is within a given stage and insufficient for debugging.
  • Code and data interact across multiple stages.

Stage 1

Expensive Record Identification

38

InputID Output ID Stg Latency input1

  • 1

40 ms input2

  • 2

30 ms input3

  • 2

25 ms input4

  • 3

40 ms input5

  • 1

55 ms input6

  • 3

60 ms

Stage 2

InputID Output ID Stg Latency

  • 1
  • utput1

65 ms

  • 2
  • utput2

70 ms

  • 3
  • utput3

40 ms

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-35
SLIDE 35
  • Stage Latency is within a given stage and insufficient for debugging.
  • Code and data interact across multiple stages.

Stage 1

Expensive Record Identification

39

InputID Output ID Stg Latency input1

  • 1

40 ms input2

  • 2

30 ms input3

  • 2

25 ms input4

  • 3

40 ms input5

  • 1

55 ms input6

  • 3

60 ms

Stage 2

InputID Output ID Stg Latency

  • 1
  • utput1

65 ms

  • 2
  • utput2

70 ms

  • 3
  • utput3

40 ms

? ? What about the slowest records in each stage?

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-36
SLIDE 36
  • Stage Latency is within a given stage and insufficient for debugging.
  • Code and data interact across multiple stages.

Stage 1

Expensive Record Identification

40

InputID Output ID Stg Latency input1

  • 1

40 ms input2

  • 2

30 ms input3

  • 2

25 ms input4

  • 3

40 ms input5

  • 1

55 ms input6

  • 3

60 ms

Stage 2

InputID Output ID Stg Latency

  • 1
  • utput1

65 ms

  • 2
  • utput2

70 ms

  • 3
  • utput3

40 ms

? ? Computation skew can span multiple stages!

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-37
SLIDE 37

Stage K

Propagate End-to-End Latency

41

InputID Output ID E2E Latency input1

  • 1

40 ms input2

  • 2

30 ms input3

  • 2

25 ms input4

  • 3

40 ms input5

  • 1

55 ms input6

  • 3

60 ms

Stage K+1

InputID Output ID Stg Latency

  • 1
  • utput1

65 ms

  • 2
  • utput2

70 ms

  • 3
  • utput3

40 ms

PerfDebug propagates end-to-end latency by adding stage latency to the slowest (max) end-to-end latency of the previous stage.

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-38
SLIDE 38

Stage K

Propagate End-to-End Latency

42

InputID Output ID E2E Latency input1

  • 1

40 ms input2

  • 2

30 ms input3

  • 2

25 ms input4

  • 3

40 ms input5

  • 1

55 ms input6

  • 3

60 ms

Stage K+1

InputID Output ID Stg Latency

  • 1
  • utput1

65 ms

  • 2
  • utput2

70 ms

  • 3
  • utput3

40 ms

PerfDebug propagates end-to-end latency by adding stage latency to the slowest (max) end-to-end latency of the previous stage.

E2E Latency 65 + max(40,55)

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-39
SLIDE 39

Stage K

Propagate End-to-End Latency

43

InputID Output ID E2E Latency input1

  • 1

40 ms input2

  • 2

30 ms input3

  • 2

25 ms input4

  • 3

40 ms input5

  • 1

55 ms input6

  • 3

60 ms

Stage K+1

InputID Output ID Stg Latency

  • 1
  • utput1

65 ms

  • 2
  • utput2

70 ms

  • 3
  • utput3

40 ms

PerfDebug propagates end-to-end latency by adding stage latency to the slowest (max) end-to-end latency of the previous stage.

E2E Latency 65 + max(40,55) 70 + max(30,25) 40 + max(40,60)

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-40
SLIDE 40

Stage K

Propagate End-to-End Latency

44

InputID Output ID E2E Latency input1

  • 1

40 ms input2

  • 2

30 ms input3

  • 2

25 ms input4

  • 3

40 ms input5

  • 1

55 ms input6

  • 3

60 ms

Stage K+1

InputID Output ID E2E Latency

  • 1
  • utput1

120 ms

  • 2
  • utput2

100 ms

  • 3
  • utput3

100 ms

PerfDebug propagates end-to-end latency by adding stage latency to the slowest (max) end-to-end latency of the previous stage.

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-41
SLIDE 41

Stage K

Propagate End-to-End Latency

45

InputID Output ID E2E Latency input1

  • 1

40 ms input2

  • 2

30 ms input3

  • 2

25 ms input4

  • 3

40 ms input5

  • 1

55 ms input6

  • 3

60 ms

Stage K+1

InputID Output ID E2E Latency

  • 1
  • utput1

120 ms

  • 2
  • utput2

100 ms

  • 3
  • utput3

100 ms

PerfDebug propagates end-to-end latency by adding stage latency to the slowest (max) end-to-end latency of the previous stage.

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-42
SLIDE 42

Propagate Expensive Inputs

  • Not all inputs contribute equally to application performance.
  • Data provenance alone cannot differentiate between these inputs if

multiple map to the same record.

46

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-43
SLIDE 43

Propagate Expensive Inputs

47

Stage 1

InputID Output ID E2E Latency input1

  • 1

40 ms input2

  • 2

30 ms input3

  • 2

25 ms input4

  • 3

40 ms input5

  • 1

55 ms input6

  • 3

60 ms

Stage 2

InputID Output ID E2E Latency

  • 1
  • utput1

120 ms

  • 2
  • utput2

100 ms

  • 3
  • utput3

100 ms

For each record within each stage, PerfDebug extends end-to-end latency by tracking the program input for the path of max latency.

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-44
SLIDE 44

Propagate Expensive Inputs

48

Stage 1

InputID Output ID E2E Latency input1

  • 1

40 ms input2

  • 2

30 ms input3

  • 2

25 ms input4

  • 3

40 ms input5

  • 1

55 ms input6

  • 3

60 ms

Stage 2

InputID Output ID E2E Latency

  • 1
  • utput1

120 ms

  • 2
  • utput2

100 ms

  • 3
  • utput3

100 ms

For each record within each stage, PerfDebug extends end-to-end latency by tracking the program input for the path of max latency.

  • Exp. Input

input1 input2 input3 input4 input5 input6

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-45
SLIDE 45

Propagate Expensive Inputs

49

Stage 1

InputID Output ID E2E Latency input1

  • 1

40 ms input2

  • 2

30 ms input3

  • 2

25 ms input4

  • 3

40 ms input5

  • 1

55 ms input6

  • 3

60 ms

Stage 2

InputID Output ID E2E Latency

  • 1
  • utput1

120 ms

  • 2
  • utput2

100 ms

  • 3
  • utput3

100 ms

For each record within each stage, PerfDebug extends end-to-end latency by tracking the program input for the path of max latency.

  • Exp. Input

input1 input2 input3 input4 input5 input6

  • Exp. Input

input5

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-46
SLIDE 46

Stage 2

Propagate Expensive Inputs

50

Stage 1

InputID Output ID E2E Latency input1

  • 1

40 ms input2

  • 2

30 ms input3

  • 2

25 ms input4

  • 3

40 ms input5

  • 1

55 ms input6

  • 3

60 ms InputID Output ID E2E Latency

  • 1
  • utput1

120 ms

  • 2
  • utput2

100 ms

  • 3
  • utput3

100 ms

For each record within each stage, PerfDebug extends end-to-end latency by tracking the program input for the path of max latency.

  • Exp. Input

input1 input2 input3 input4 input5 input6

  • Exp. Input

input5 input2 input6

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification

slide-47
SLIDE 47

PerfDebug Approach Recap

  • Monitoring to detect presence of computation skew
  • Instrumented execution to collect data provenance and latency
  • Propagation algorithm to analyze end-to-end record impact and

identify records responsible for computation skew

51

Output: Individual records responsible for computation skew

PerfDebug

Computation Skew Detection Data Provenance + Record-Level Latency Expensive Record Identification Input: Spark program, input data

slide-48
SLIDE 48

Evaluation

RQ1: What is the impact of applying appropriate remediations? RQ2: How much overhead does PerfDebug introduce? RQ3: How accurate is PerfDebug at identifying delay-inducing inputs?

53

slide-49
SLIDE 49

RQ RQ1: Remediation Impact

  • Three case studies with varying computation skew causes: data skew,

data quality, and expensive UDF.

  • 1.5X to 16X performance improvement with case-specific fixes.

54

15-27 GB Data 10 Workers 1 Master 24GB Memory / worker

slide-50
SLIDE 50

NY NYC T Taxi T Tri rips C Case St Study

56

Worker1 Worker2

(Borough,Cost) Shuffle Average by borough

Goal: compute average cost of a taxi ride for each starting borough

27 GB (173M rows)

slide-51
SLIDE 51

NY NYC T Taxi T Tri rips C Case St Study

Worker1 Worker2

220 s 400 s 15 s

Total runtime: ~7 minutes

20 s

57

(Borough,Cost) Shuffle Average by borough 27 GB (173M rows)

slide-52
SLIDE 52

NY NYC T Taxi T Tri rips C Case St Study

Worker1 Worker2

220 s 400 s 15 s

Task times show that data skew is a minor performance factor.

20 s

58

(Borough,Cost) Shuffle Average by borough 27 GB (173M rows)

slide-53
SLIDE 53

NY NYC T Taxi T Tri rips C Case St Study

Worker1 Worker2

220 s 400 s 15 s

PerfDebug detects potential computation skew in the first stage.

20 s

59

(Borough,Cost) Shuffle Average by borough 27 GB (173M rows)

slide-54
SLIDE 54

NY NYC T Taxi T Tri rips C Case St Study

60

Worker1 Worker2

PerfDebug identifies the outputs with the highest latency and uses provenance to trace the corresponding inputs.

Latency Heatmap

(Borough,Cost) Shuffle Average by borough 27 GB (173M rows)

slide-55
SLIDE 55

NY NYC T Taxi T Tri rips C Case St Study

61

Worker1 Worker2

PerfDebug identifies the outputs with the highest latency and uses provenance to trace the corresponding inputs.

Latency Heatmap

(Borough,Cost) Shuffle Average by borough 27 GB (173M rows)

slide-56
SLIDE 56

NY NYC T Taxi T Tri rips C Case St Study

62

Worker1 Worker2

PerfDebug identifies the outputs with the highest latency and uses provenance to trace the corresponding inputs.

Latency Heatmap

(Borough,Cost) Shuffle Average by borough 27 GB (173M rows)

slide-57
SLIDE 57

NY NYC T Taxi T Tri rips C Case St Study R Results

  • PerfDebug isolates the source of computation skew to a small subset
  • f inputs: 0.0006%
  • Inspection reveals that a getBorough UDF consumes majority of task

time.

63

Removal of these records results in ~16X performance improvement.

slide-58
SLIDE 58

RQ RQ2: Instrumentation Overhead

  • Three benchmarks, ten trials each.
  • Titian adds ~30% runtime overhead versus Spark [VLDB 2016].
  • PerfDebug adds ~30% runtime overhead compared to Titian.
  • Majority of additional overhead due to using persistent storage for

post-mortem debugging, which was not required in Titian.

64

slide-59
SLIDE 59

Benchmark Accuracy Precision Improvement Overhead Movie Ratings 100% 103 X 1.04X College Students 100% 106 X 1.39X Weather Analysis 100% 102 X 1.48X Average 100% 105 X 1.30X

RQ RQ3: Precision and Recall

  • Three benchmarks, ten trials each.
  • Use mutation testing to randomly inject an input record with delays.
  • PerfDebug consistently identified target: 100% precision and recall.
  • 2-6 orders of magnitude better precision compared to provenance-
  • nly input tracing of outputs using Titian.

65

slide-60
SLIDE 60

Conclusion

  • PerfDebug is a post-mortem performance debugging tool that

combines data provenance and record-level latency instrumentation to precisely pinpoint records which cause computation skew.

  • Case-specific fixes can yield up to 16X performance improvement.

66

slide-61
SLIDE 61

Related Work

  • Ernest [NSDI 2016], ARIA [ICAC 2011], Jockey [Eurosys 2012], Starfish

[CIDR 2011]: performance modeling for prediction, but not debugging

  • f computation skew
  • PerfXplain [VLDB 2012]: job and task comparison for debugging and

explanation with respect to collected metrics.

  • Titian [VLDB 2016]: data provenance within Apache Spark, used as

foundation for PerfDebug implementation.

  • Additional works mentioned in paper.

67