efficient data parallel cumulative aggregates for large
play

Efficient Data-Parallel Cumulative Aggregates for Large-Scale - PowerPoint PPT Presentation

1 SCIENCE PASSION TECHNOLOGY Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning Matthias Boehm 1 , Alexandre V. Evfimievski 2 , Berthold Reinwald 2 1 Graz University of Technology; Graz, Austria 2 IBM Research


  1. 1 SCIENCE PASSION TECHNOLOGY Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning Matthias Boehm 1 , Alexandre V. Evfimievski 2 , Berthold Reinwald 2 1 Graz University of Technology; Graz, Austria 2 IBM Research – Almaden; San Jose, CA, USA

  2. Introduction and Motivation Motivation Large-Scale ML 2 Feedback Loop  Large-Scale Machine Learning Data  Variety of ML applications (supervised, semi-/unsupervised)  Large data collection (labels: feedback, weak supervision) Model Usage  State-of-the-art ML Systems  Batch algorithms  Data-/task-parallel operations  Mini-batch algorithms  Parameter server  Data-Parallel Distributed Operations  Linear Algebra (matrix multiplication, element-wise operations, structural and grouping aggregations, statistical functions)  Meta learning (e.g., cross validation, ensembles, hyper-parameters)  In practice: also reorganizations and cumulative aggregates Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  3. Introduction and Motivation Motivation Cumulative Aggregates 3  Example Z = cumsum ( X ) 1 2 1 2 � 1 1 2 3 with Z ij = ∑ � �� Prefix Sums = X ij + Z (i-1)j ��� 3 4 5 7 2 1 7 8  Applications  #1 Iterative survival analysis: Cox Regression / Kaplan-Meier  #2 Spatial data processing via linear algebra, cumulative histograms  #3 Data preprocessing: subsampling of rows / remove empty rows  Parallelization MPI: 7 0 2  Recursive formulation looks inherently sequential 2 5 rank rank  Classic example for parallelization via aggregation trees 1 2 (message passing or shared memory HPC systems) 1 1 3 2  Question: Efficient, Data-Parallel Cumulative Aggregates? (blocked matrices as unordered collections in Spark or Flink) Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  4. Outline 4  SystemML Overview and Related Work  Data-Parallel Cumulative Aggregates  System Integration  Experimental Results Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  5. 5 SystemML Overview and Related Work Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  6. SystemML Overview and Related Work High-Level SystemML Architecture 6 DML Scripts DML ( D eclarative M achine APIs: Command line, JMLC, Learning L anguage) Spark MLContext, Spark ML, Language (20+ scalable algorithms) Compiler [SIGMOD’15,’17,‘19] [PVLDB’14,’16a,’16b,’18] 05/2017 Apache Top-Level Project [ICDE’11,’12,’15] Runtime 11/2015 Apache Incubator Project [CIDR’17] 08/2015 Open Source Release [VLDBJ’18] [DEBull’14] [PPoPP’15] In-Memory Single Node Hadoop or Spark Cluster (scale-up) (scale-out) In-Progress: GPU since 2014/16 since 2010/11 since 2012 since 2015 Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  7. SystemML Overview and Related Work Basic HOP and LOP DAG Compilation 7 LinregDS (Direct Solve) Cluster Config: 8KB HOP DAG • driver mem: 20 GB X = read ( $1 ); Scenario: CP write (after rewrites) y = read ( $2 ); X: 10 8 x 10 3 , 10 11 • exec mem: 60 GB 8MB intercept = $3 ; y: 10 8 x 1, 10 8 16MB CP b(solve) lambda = 0.001; CP b(+) ... 172KB 1.6TB if ( intercept == 1 ) { CP SP 800GB r(diag) ba(+*) ba(+*) ones = matrix (1, nrow (X), 1); SP X = append (X, ones); 1.6TB } r(t) SP 8KB I = matrix (1, ncol (X), 1); 800GB 800MB CP dg(rand) X y A = t (X) %*% X + diag (I)*lambda; (10 3 x1,10 3 ) (10 8 x10 3 ,10 11 ) (10 8 x1,10 8 ) b = t (X) %*% y; beta = solve (A, b); ... 16KB LOP DAG write (beta, $4 ); r’(CP) (after rewrites)  Hybrid Runtime Plans: mapmm(SP) tsmm(SP) 800MB • Size propagation / memory estimates 1.6GB X • Integrated CP / Spark runtime r’(CP) X 1,1 (persisted in  Distributed Matrices MEM_DISK) X 2,1 y • Fixed-size (squared) matrix blocks • Data-parallel operations X m,1 Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  8. SystemML Overview and Related Work Cumulative Aggregates in ML Systems 8 (Straw-man Scripts and Built-in Support) 1: cumsumN2 = function ( Matrix [ Double ] A) 1: cumsumNlogN = function ( Matrix [ Double ] A) 2: return ( Matrix [ Double ] B) 2: return(Matrix[Double] B) 3: { 3: { 4: B = A; csums = matrix (0,1,ncol(A)); 4: B = A; m = nrow (A); k = 1; 5: for ( i in 1: nrow (A) ) { 5: while ( k < m ) { 6: csums = csums + A[i,]; 6: B[(k+1):m,] = B[(k+1):m,] + B[1:(m-k),]; 7: B[i,] = csums; 7: k = 2 * k; 8: } 8: } copy-on-write     O(n^2)     O(n log n) 9: } 9: }     Qualify for update in-place, but still too slow  ML Systems  Update in-place: R (ref count), SystemML (rewrites), Julia  Builtins in R, Matlab, Julia, NumPy, SystemML (since 2014) cumsum (), cummin (), cummax (), cumprod ()  SQL  SELECT Rid, V, sum (V) OVER ( ORDER BY Rid) AS cumsum FROM X  Sequential and parallelized execution (e.g., [Leis et al, PVLDB’15]) Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  9. 9 Data-Parallel Cumulative Aggregates Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  10. Data-Parallel Cumulative Aggregates DistCumAgg Framework 10  Basic Idea: self-similar operator chain (forward, local, backward) block-local aggregates cumagg aggregates of aggregates Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  11. Data-Parallel Cumulative Aggregates Basic Cumulative Aggregates 11  Instantiating Operation Init f agg f off f cumagg Basic cumsum ( X ) 0 colSums ( B ) B 1: =B 1: +a cumsum ( B ) Cumulative cummin ( X ) ∞ colMins ( B ) B 1: =min(B 1: ,a) cummin ( B ) Aggregates cummax ( X ) -∞ colMaxs ( B ) B 1: =max(B 1: ,a) cummax ( B ) cumprod ( X ) 1 colProds ( B ) B 1: =B 1: *a cumprod ( B )  Example cumsum ( X ) fused to avoid copy Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  12. Data-Parallel Cumulative Aggregates Complex Cumulative Aggregates 12 1 .2 Exponential  Instantiating 1 .1 Z = cumsumprod ( X ) = cumsumprod ( Y , W ) smoothing 3 .0 with Z i = Y i + W i * Z i-1 , Z 0 =0 Complex 2 .1 Recurrences Init f agg f off f cumagg Equations 0 cbind ( cumsumprod ( B ) n1 , B 11 =B 11 +B 12 *a cumsumprod ( B ) prod ( B :2 ))  Example cumsumprod(X) Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  13. 13 System Integration Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  14. System Integration Simplification Rewrites 14  Example #1: Suffix Sums  Problem: Distributed reverse causes data shuffling =  Compute via column aggregates and prefix sums rev ( cumsum ( rev ( X )))  X + colSums ( X ) – cumsum ( X ) (broadcast) (partitioning-preserving)  Example #2: Extract Lower Triangular  Problem: Indexing cumbersome/slow; cumsum densifying 1 0 0 0 0 0 0 1 1 0 0 0 0 0  Use dedicated operators 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 0 0 X * cumsum ( diag ( matrix (1, nrow ( X ),1))) 1 1 1 1 1 1 0  lower.tri ( X ) 1 1 1 1 1 1 1 Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

  15. System Integration Execution Plan Generation 15  Compilation Chain of Cumulative Aggregates  Execution type selection based on memory estimates  Physical operator config (broadcast, aggregation, in-place, #threads)  Example Low-Level Runtime Plan Operators (LOPs) in-place High-Level #threads SP , k+, Operator (HOP) cumagg broadcast 1: ... 2: SP ucumack+ _mVar1 _mVar2 3: CP ucumk+ _mVar2 _mVar3 24 T u(cumsum) CP , 24, 4: CP rmvar _mVar2 u(cumsum) in-place 5: SP bcumoffk+ _mVar1 _mVar3 cumagg _mVar4 0 T SP , k+ 6: CP rmvar _mVar1 _mVar3 X nrow(X) ≥ b 7: ... X broadcast Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend