understanding distributed dataflow
play

UNDERSTANDING DISTRIBUTED DATAFLOW John Liagouris - PowerPoint PPT Presentation

UNDERSTANDING DISTRIBUTED DATAFLOW John Liagouris liagos@inf.ethz.ch SYSTEMS OUTPUT EXPLANATION AND PERFORMANCE ANALYSIS Google 3 May 2017 PART I: Why is this record in the output of my distributed dataflow? Concise explanations of


  1. UNDERSTANDING DISTRIBUTED DATAFLOW John Liagouris liagos@inf.ethz.ch SYSTEMS OUTPUT EXPLANATION AND PERFORMANCE ANALYSIS Google 3 May 2017

  2. PART I: Why is this record in the output of my distributed dataflow? ▸ Concise explanations of individual outputs ▸ On-demand output reproduction PART II: Why is my distributed dataflow slow? ▸ Bottleneck detection ▸ Critical path analysis 2

  3. COLLABORATORS Vasiliki Kalavri Ralf Sager Andrea Lattuada Desislava Dimitrova Zaheer Chothia Sebastian Wicki Frank McSherry Moritz Hoffmann Timothy Roscoe 3

  4. THE BIG PICTURE: UNDERSTANDING THE DATACENTER Strymon Enterprise Datacenter event logs ‣ The volume of datacenter logs is huge ‣ Keeping archives is not a viable solution ‣ We can process logs online 4

  5. THE BIG PICTURE: UNDERSTANDING THE DATACENTER Strymon Enterprise Datacenter event logs Strymon is a novel system able to: ‣ Perform deep analytics on thousands of distributed streams of event logs in parallel ‣ Explain its outputs interactively 5

  6. IDEAS IN STRYMON CAN BE GENERALIZED for dataflow systems iterative analytics input stream output stream streaming analytics and different execution models worker 1 synchronous vs asynchronous shared-nothing vs shared-memory worker 2 6

  7. TIMELY DATAFLOW D. Murray, F. McSherry, M. Isard, R. Isaacs, P. Barham, M. Abadi. Naiad: A Timely Dataflow System. In SOSP, 2013. ▸ A steaming framework for data-parallel computations ▸ Cyclic dataflows ▸ Logical timestamps (epochs) ▸ Asynchronous execution ▸ Low latency DIFFERENTIAL DATAFLOW F. McSherry, D. Murray, R. Isaacs, M. Isard. Differential Dataflow . In CIDR, 2013. ▸ A high-level API on top of Timely Dataflow ▸ Incremental computation 7

  8. PART I Why is this record in the output of my distributed dataflow? 8

  9. EXPLANATIONS IN DATABASES COMPUTATION 1 2 3 PROVENANCE 9

  10. THE PROBLEM: OUTPUT EXPLANATION OUTPUT INPUT 10

  11. THE PROBLEM: OUTPUT EXPLANATION THIS RECORD LOOKS WRONG! {App 115 344} {A 115 344} {VM 233 -22} {F 233 122} {App 100 55} {W 100 -95} {VM 333 -124} {V 30 23} … … … … … … OUTPUT INPUT 11

  12. THE PROBLEM: OUTPUT EXPLANATION THIS RECORD LOOKS WRONG! {App 115 344} {A 115 344} {VM 233 -22} {F 233 122} {App 100 55} {W 100 -95} {VM 333 -124} {V 30 23} … … … … … … OUTPUT INPUT 12

  13. THE PROBLEM: OUTPUT EXPLANATION THIS RECORD LOOKS WRONG! {App 115 344} {A 115 344} {VM 233 -22} {F 233 122} {App 100 55} {W 100 -95} {VM 333 -124} {V 30 23} … … … … … … OUTPUT INPUT Output explanation: A subset of the input that is sufficient to reproduce the selected subset of the output 13

  14. ANNOTATION-BASED TECHNIQUES metadata propagation 1 2 3 ▸ Fast ▸ Explode in size 14

  15. INVERSION-BASED TECHNIQUES 1’ 2’ 3’ ▸ Small memory footprint ▸ Not generally applicable 15

  16. BACKWARD TRACING dependencies 1 2 3 ▸ Small memory footprint ▸ Generally applicable ▸ Fast 16

  17. PROBLEM 1: TOO MUCH INFORMATION Use Case: Graph Rechability 2 5 3 1 4 17

  18. PROBLEM 1: TOO MUCH INFORMATION Use Case: Graph Reachability WHY IS (1,3) IN THE OUTPUT? ▸ Record (1,3) appears in the result 2 5 3 1 4 18

  19. PROBLEM 1: TOO MUCH INFORMATION Use Case: Graph Reachability WHY IS (1,3) IN THE OUTPUT? ▸ Record (1,3) appears in the result 2 ▸ Naive backward tracing returns as an explanation all 5 3 1 4 edges of the graph 19

  20. PROBLEM 1: TOO MUCH INFORMATION Use Case: Graph Reachability WHY IS (1,3) IN THE OUTPUT? ▸ Record (1,3) appears in the result 2 ▸ Naive backward tracing returns as an explanation all 5 3 1 4 edges of the graph ▸ A shortest path suffices 20

  21. PROBLEM 2: NOT ENOUGH INFORMATION Use Case: Word Set Difference THE QUICK A BROWN FOX … THE LAZY DOG B … 21

  22. PROBLEM 2: NOT ENOUGH INFORMATION Use Case: Word Set Difference WHY ONLY 3 WORDS ARE ▸ Record (doc A, 3 unique words) UNIQUE TO DOCUMENT A? appears in the result THE QUICK A BROWN FOX (doc A, 3 unique words) … THE LAZY DOG B (doc B, 2 unique words) … 22

  23. PROBLEM 2: NOT ENOUGH INFORMATION Use Case: Word Set Difference WHY ONLY 3 WORDS ARE ▸ Record (doc A, 3 unique words) UNIQUE TO DOCUMENT A? appears in the result THE QUICK A BROWN FOX (doc A, 3 unique words) … ▸ Naive backward tracing returns as an explanation only the words of doc A THE LAZY DOG B (doc B, 2 unique words) … 23

  24. PROBLEM 2: NOT ENOUGH INFORMATION Use Case: Word Set Difference WHY ONLY 3 WORDS ARE ▸ Record (doc A, 3 unique words) UNIQUE TO DOCUMENT A? appears in the result THE QUICK A BROWN FOX (doc A, 3 unique words) … ▸ Naive backward tracing returns as an explanation only the words of doc A THE LAZY DOG B (doc B, 2 unique words) … ▸ We also need the words of doc B to reproduce the record (doc A, 3 unique words) 24

  25. CAN WE SOLVE BOTH PROBLEMS? Yes! Given that the system is able to: ▸ Keep track of the exact point in the computation a data record was produced ▸ Detect divergent records when replaying the computation on a subset of the input We exploit the main features of Differential Dataflow 25

  26. EXPLANATIONS WITH DIFFERENTIAL DATAFLOW Op B Original INPUT OUTPUT Op A Op C dataflow: 26

  27. EXPLANATIONS WITH DIFFERENTIAL DATAFLOW Op B Original INPUT OUTPUT Op A Op C dataflow: Join Explanation INPUT OUTPUT Join Join dataflow: Augment the original dataflow with a shadow dataflow 27

  28. ITERATIVE BACKWARD TRACING Op B Original INPUT OUTPUT Op A Op C dataflow: Join Explanation EXPL QUERY Join Join dataflow: 28

  29. ITERATIVE BACKWARD TRACING Op B Original INPUT OUTPUT Op A Op C dataflow: Trace Backwards Join Explanation EXPL QUERY Join Join dataflow: 29

  30. ITERATIVE BACKWARD TRACING Op B Original INPUT OUTPUT Op A Op C dataflow: Compare Replay Join Explanation EXPL QUERY Join Join dataflow: 30

  31. ITERATIVE BACKWARD TRACING Op B Original INPUT OUTPUT Op A Op C dataflow: k1 v k2 v’ … … k1 v k2 v’’ … … Trace divergent records backwards Join Explanation EXPL QUERY Join Join dataflow: 31

  32. ITERATIVE BACKWARD TRACING Op B Original INPUT OUTPUT Op A Op C dataflow: Compare Replay again (for the new records) Join Explanation EXPL QUERY Join Join dataflow: 32 Repeat until a fix-point

  33. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE THE QUICK A BROWN FOX … THE LAZY DOG B … 33 33

  34. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) THE QUICK A (BROWN, A) BROWN FOX MAP … (FOX, A) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) 34

  35. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) (THE, [A,B]) THE QUICK A (BROWN, A) BROWN FOX (BROWN, A) MAP … (FOX, A) (FOX, A) GROUP (LAZY, B) (DOG,B) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) 35

  36. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) (THE, [A,B]) THE QUICK A (BROWN, A) BROWN FOX (BROWN, A) MAP … (FOX, A) (FOX, A) GROUP (LAZY, B) (DOG,B) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) FILTER (BROWN,A) (FOX,A) (LAZY,B) (DOG,B) 36

  37. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) (THE, [A,B]) THE QUICK A (BROWN, A) BROWN FOX (BROWN, A) MAP … (FOX, A) (FOX, A) GROUP (LAZY, B) (DOG,B) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) FILTER (BROWN,A) (FOX,A) (A, 3) (LAZY,B) GROUP (B, 2) (DOG,B) 37

  38. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) (THE, [A,B]) THE QUICK A (BROWN, A) BROWN FOX (BROWN, A) MAP … (FOX, A) (FOX, A) GROUP (LAZY, B) (DOG,B) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) FILTER (BROWN,A) (FOX,A) (A, 3) (LAZY,B) GROUP (B, 2) (DOG,B) 38

  39. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) (THE, A) (THE, [A,B]) THE QUICK A (BROWN, A) BROWN FOX (BROWN, A) MAP … (FOX, A) (FOX, A) GROUP (LAZY, B) (DOG,B) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) FILTER (BROWN,A) (FOX,A) (A, 3) (LAZY,B) GROUP (B, 2) (DOG,B) 39

  40. EXAMPLE: EXPLAINING OUTPUTS OF WORD SET DIFFERENCE (THE, A) (THE, A) (THE, [A,B]) THE QUICK A (BROWN, A) BROWN FOX (BROWN, A) MAP … (FOX, A) (FOX, A) GROUP (LAZY, B) (DOG,B) (THE, B) THE LAZY DOG B (LAZY, B) MAP … (DOG, B) FILTER (BROWN,A) (FOX,A) (A, 3) (LAZY,B) GROUP (B, 2) (DOG,B) 40

  41. RESULTS: EXPLAINING CONNECTED COMPONENTS ▸ Dataset: A subset of the Twitter graph with 1B edges ▸ Algorithm: Label propagation ▸ Output: Records of the form (A,B) denoting that nodes A and B belong to the same connected component ▸ System used: Differential Dataflow ▸ Machine used: Intel Xeon E5-4640 at 2.4GHz with 32 cores and 500G RAM More results: Z. Chothia, J. Liagouris, F. McSherry, T. Roscoe Explaining Outputs in Modern Data Analytics PVDLB 9(12):1137-1148, 2016. 41

  42. EXPLAINING CONNECTED COMPONENTS Twitter 42

  43. PART II Why is my distributed dataflow slow? 43

  44. DISTRIBUTED DATAFLOWS client scheduler Apache Flink W1 Naiad W1 44

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend