enabling operator reordering in data flow programs
play

Enabling Operator Reordering in Data Flow Programs Through Static - PowerPoint PPT Presentation

Enabling Operator Reordering in Data Flow Programs Through Static Code Analysis XLDI 2012 Fabian Hueske, Aljoscha Krettek , Kostas Tzoumas Database Systems and Information Management Technische Universitt Berlin


  1. Enabling Operator Reordering in Data Flow Programs Through Static Code Analysis XLDI 2012 Fabian Hueske, Aljoscha Krettek , Kostas Tzoumas Database Systems and Information Management Technische Universität Berlin aljoscha.krettek@campus.tu-berlin.de September 9 th 2012

  2. Agenda Stratosphere Operator Reordering Motivation Operator Reordering Static Code Analysis Conclusion 1/14

  3. Motivation: Big Data Analytics Stratosphere ◮ “Big Data” revolution Operator Reordering ◮ Huge amounts of machine- and human- generated data, often semi-structured ◮ Need for “deep” analytics beyond simple BI queries ◮ Breed of new parallel data management systems ◮ Hadoop, Stratosphere, Asterix, SCOPE, etc. ◮ Common themes in programming models ◮ Data flows composed (in part) of functions written in arbitrary imperative code ◮ Also seen in modern MPP SQL systems (Greenplum, Aster) ◮ Allows more powerful analytics on diverse data sets 2/14

  4. Stratosphere Stratosphere $res = filter $e in $emp Operator Reordering where $e.income > 30000; Compiler Scientific Data Life Sciences Linked Data Query Processor PACT Optimizer Nephele ... 3/14

  5. The PACT Programming Model Stratosphere Sink 1 [ A , B ] ◮ Generalization of MapReduce Operator Reordering ◮ Data flow consisting of data Reduce ( f 4 , A ) sum ( B ) sources, sinks, and operators ◮ Operators consist of [ A , B , C , D , E ] ◮ Second-order function signature from a fixed set of system-defined Match ( f 3 , A , D ) SOFs - PArallelization ConTracts ◮ First-order function written by programmer in Java [ A , B , C ] [ D , E ] ◮ Intermediate representation, but also exposed to the user Map ( f 1 ) Map ( f 2 ) ◮ E.g., to implement functionality C ← A + B filter ( E > 3 ) not supported by query language [ A , B ] [ D , E ] Src 1 Src 2 4/14

  6. Automatic Parallelization Stratosphere Sink 1 Reduce ( f 4 , A ) Operator Reordering sum ( B ) ◮ Knowledge of PACT signature permits automatic parallelization fifo ◮ E.g., for Match operator ◮ Choice of broadcast, partition, Match ( f 3 , A , D ) SFR, etc ◮ Sort-merge or hash-based partition/sort(A) broadcast physical implementation probeHT buildHT ◮ Cascades-style optimizer ◮ Partitioning strategies Map ( f 1 ) Map ( f 2 ) propagated top-down as C ← A + B filter ( E > 3 ) interesting properties [ A , B ] [ D , E ] Src 1 Src 2 5/14

  7. Need for Operator Reordering Stratosphere ◮ Operator reordering may Operator Reordering reduce amount of Sink 1 intermediate data sets ◮ May introduce new Match ( f 3 , A , D ) opportunities for parallelization strategies Reduce ( f 4 , A ) Map ( f 2 ) ◮ For optimal execution, sum ( B ) filter ( E > 3 ) need to consider operator order, parallelization Map ( f 1 ) [ D , E ] strategies, and physical C ← A + B [ A , B ] Src 2 execution in one step Src 1 ◮ SOF signature not enough - need to look inside FOF 6/14

  8. Experimental Results Stratosphere Operator Reordering 12000 Best Order x10.0 Worst Order 10000 Runtime in sec 8000 6000 x1.8 4000 x7.1 2000 0 TPC-H Q7 Clickstream Textmining Processing 7/14

  9. Reordering Conditions Stratosphere We can reorder operators when we know some specific properties of the user defined code. 1 Operator Reordering Define: ◮ Read set: Attributes that might influence FOFs output ◮ Write set: Attributes that might have different value after application of FOF Example, Map-Map reordering: ◮ Two Map operators can be reordered if the FOFs operate on distinct values or have only read-read conflicts Too cumbersome to ask programmer to specify read and write sets, therefore we want to estimate them using static code analysis on generic FOFs 1 Opening the Black Boxes in Data Flow Optimization (VLDB 2012) 8/14

  10. Example FOF Stratosphere Fixed API for dealing with records: create , copy , get , void match(Record left, 1 Operator Reordering set , setNull , and union . Record right, 2 Collector col) { 3 Read set is easily determined Record out = copy(left); 4 by looking at all get if (right.get(F) > 3) { 5 statements. Write set depends out.set(D, right.get(D)); 6 on the schema of the data: } else { 7 out.setNull(A); 8 ◮ Determine four other sets: } 9 origin, write, copy, out.set(E, right.get(E)); 10 projection out.set(F, 42); 11 ◮ Generate final write set col.emit(out); 12 } from these and schema 13 information 9/14

  11. Example FOF (cont.) Stratosphere Schema: void match(Record left, 1 Operator Reordering Record right, Left [A,B,C], Right [D,E,F] 2 Collector col) { 3 Record out = copy(left); Origin: { 1 } 4 if (right.get(F) > 3) { 5 Explicit projection l : { A } out.set(D, right.get(D)); 6 Explicit copy r : { E } } else { 7 Explicit write l : { F } out.setNull(A); 8 Explicit write r : {} } 9 out.set(E, right.get(E)); 10 Final write set l : { A , F } out.set(F, 42); 11 Final write set r : { D , F } col.emit(out); 12 } 13 10/14

  12. Example FOF (cont.) Stratosphere Schema: void match(Record left, 1 Operator Reordering Record right, Left [A,B,C], Right [D,E,F] 2 Collector col) { 3 Record out = copy(left); Origin: { 1 } 4 if (right.get(F) > 3) { 5 Explicit projection l : { A } out.set(D, right.get(D)); 6 Explicit copy r : { E } } else { 7 Explicit write l : { F } out.setNull(A); 8 Explicit write r : {} } 9 out.set(E, right.get(E)); 10 Final write set l : { A , F } out.set(F, 42); 11 Final write set r : { D , F } col.emit(out); 12 } 13 10/14

  13. Example FOF (cont.) Stratosphere Schema: void match(Record left, 1 Operator Reordering Record right, Left [A,B,C], Right [D,E,F] 2 Collector col) { 3 Record out = copy(left); Origin: { 1 } 4 if (right.get(F) > 3) { 5 Explicit projection l : { A } out.set(D, right.get(D)); 6 Explicit copy r : { E } } else { 7 Explicit write l : { F } out.setNull(A); 8 Explicit write r : {} } 9 out.set(E, right.get(E)); 10 Final write set l : { A , F } out.set(F, 42); 11 Final write set r : { D , F } col.emit(out); 12 } 13 10/14

  14. Example FOF (cont.) Stratosphere Schema: void match(Record left, 1 Operator Reordering Record right, Left [A,B,C], Right [D,E,F] 2 Collector col) { 3 Record out = copy(left); Origin: { 1 } 4 if (right.get(F) > 3) { 5 Explicit projection l : { A } out.set(D, right.get(D)); 6 Explicit copy r : { E } } else { 7 Explicit write l : { F } out.setNull(A); 8 Explicit write r : {} } 9 out.set(E, right.get(E)); 10 Final write set l : { A , F } out.set(F, 42); 11 Final write set r : { D , F } col.emit(out); 12 } 13 10/14

  15. Example FOF (cont.) Stratosphere Schema: void match(Record left, 1 Operator Reordering Record right, Left [A,B,C], Right [D,E,F] 2 Collector col) { 3 Record out = copy(left); Origin: { 1 } 4 if (right.get(F) > 3) { 5 Explicit projection l : { A } out.set(D, right.get(D)); 6 Explicit copy r : { E } } else { 7 Explicit write l : { F } out.setNull(A); 8 Explicit write r : {} } 9 out.set(E, right.get(E)); 10 Final write set l : { A , F } out.set(F, 42); 11 Final write set r : { D , F } col.emit(out); 12 } 13 10/14

  16. Code Analysis Stratosphere Difficult part is determining the Record out = copy(left) origin, write, copy and projection Operator Reordering sets for a user defined FOF from the control flow graph (CFG). if right.get(F) > 3 Solution is a recursive algorithm out.set(D,right.get(D)) out.setNull(A) that builds the four sets: ◮ Start from the emit out.set(E, right.get(E)) statements and traverse the CFG upwards out.set(F, 42) ◮ The sets at one node in the CFG depend on the sets of col.emit(out) the predecessors and the nature of the statement. 11/14

  17. Code Analysis (cont.) Stratosphere Record out = copy(left) Final recursion cases: ( { 1 } , ∅ , ∅ , ∅ ) $or = create() Operator Reordering → ( ∅ , ∅ , ∅ , ∅ ) if right.get(F) > 3 $or = copy($ir) ( { 1 } , ∅ , ∅ , ∅ ) → ( IN ( $ir ) , ∅ , ∅ , ∅ ) out.set(D,right.get(D)) out.setNull(A) For other statements Merge sets ( { 1 } , ∅ , D , ∅ ) ( { 1 } , ∅ , ∅ , { A } ) of predecessors and then modify out.set(E, right.get(E)) depending on type of statement: ( { 1 } , ∅ , { E } , { A } ) $or.set(n,$ir.get(n)) → add n to copy set out.set(F, 42) $or.set(n, x) ( { 1 } , { F } , { E } , { A } ) → add n to write set col.emit(out) $or.setNull(n) ( { 1 } , { F } , { E } , { A } ) → add n to projection set 12/14

  18. Conclusion Stratosphere ◮ Reordering leads to potentially significant benefits Operator Reordering ◮ Up to 10x for relational and non relational tasks in our experiments ◮ Our static code analysis algorithm can automatically derive reordering properties of generic user-written Java code ◮ Difficulties arise in non-linear CFGs (if, loops) and also because the schema of input records changes with reordering ◮ Safety achieved through conservatism ◮ Related work: Manimal 2 ◮ Techniques are complementary 2 Eaman Jahani, Michael J. Cafarella, Christopher Ré: Automatic Optimization for MapReduce Programs. PVLDB 4(6): 385-396 (2011) 13/14

  19. Thank you! www.stratosphere.eu (New open source release available)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend