Efficient Data-Parallel Cumulative Aggregates for Large-Scale - PowerPoint PPT Presentation

1 SCIENCE PASSION TECHNOLOGY Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning Matthias Boehm 1 , Alexandre V. Evfimievski 2 , Berthold Reinwald 2 1 Graz University of Technology; Graz, Austria 2 IBM Research – Almaden; San Jose, CA, USA

Introduction and Motivation Motivation Large-Scale ML 2 Feedback Loop  Large-Scale Machine Learning Data  Variety of ML applications (supervised, semi-/unsupervised)  Large data collection (labels: feedback, weak supervision) Model Usage  State-of-the-art ML Systems  Batch algorithms  Data-/task-parallel operations  Mini-batch algorithms  Parameter server  Data-Parallel Distributed Operations  Linear Algebra (matrix multiplication, element-wise operations, structural and grouping aggregations, statistical functions)  Meta learning (e.g., cross validation, ensembles, hyper-parameters)  In practice: also reorganizations and cumulative aggregates Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

Introduction and Motivation Motivation Cumulative Aggregates 3  Example Z = cumsum ( X ) 1 2 1 2 � 1 1 2 3 with Z ij = ∑ � �� Prefix Sums = X ij + Z (i-1)j �� 3 4 5 7 2 1 7 8  Applications  #1 Iterative survival analysis: Cox Regression / Kaplan-Meier  #2 Spatial data processing via linear algebra, cumulative histograms  #3 Data preprocessing: subsampling of rows / remove empty rows  Parallelization MPI: 7 0 2  Recursive formulation looks inherently sequential 2 5 rank rank  Classic example for parallelization via aggregation trees 1 2 (message passing or shared memory HPC systems) 1 1 3 2  Question: Efficient, Data-Parallel Cumulative Aggregates? (blocked matrices as unordered collections in Spark or Flink) Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

Outline 4  SystemML Overview and Related Work  Data-Parallel Cumulative Aggregates  System Integration  Experimental Results Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

5 SystemML Overview and Related Work Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

SystemML Overview and Related Work High-Level SystemML Architecture 6 DML Scripts DML ( D eclarative M achine APIs: Command line, JMLC, Learning L anguage) Spark MLContext, Spark ML, Language (20+ scalable algorithms) Compiler [SIGMOD’15,’17,‘19] [PVLDB’14,’16a,’16b,’18] 05/2017 Apache Top-Level Project [ICDE’11,’12,’15] Runtime 11/2015 Apache Incubator Project [CIDR’17] 08/2015 Open Source Release [VLDBJ’18] [DEBull’14] [PPoPP’15] In-Memory Single Node Hadoop or Spark Cluster (scale-up) (scale-out) In-Progress: GPU since 2014/16 since 2010/11 since 2012 since 2015 Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

SystemML Overview and Related Work Basic HOP and LOP DAG Compilation 7 LinregDS (Direct Solve) Cluster Config: 8KB HOP DAG • driver mem: 20 GB X = read ( $1 ); Scenario: CP write (after rewrites) y = read ( $2 ); X: 10 8 x 10 3 , 10 11 • exec mem: 60 GB 8MB intercept = $3 ; y: 10 8 x 1, 10 8 16MB CP b(solve) lambda = 0.001; CP b(+) ... 172KB 1.6TB if ( intercept == 1 ) { CP SP 800GB r(diag) ba(+*) ba(+*) ones = matrix (1, nrow (X), 1); SP X = append (X, ones); 1.6TB } r(t) SP 8KB I = matrix (1, ncol (X), 1); 800GB 800MB CP dg(rand) X y A = t (X) %*% X + diag (I)*lambda; (10 3 x1,10 3 ) (10 8 x10 3 ,10 11 ) (10 8 x1,10 8 ) b = t (X) %*% y; beta = solve (A, b); ... 16KB LOP DAG write (beta, $4 ); r’(CP) (after rewrites)  Hybrid Runtime Plans: mapmm(SP) tsmm(SP) 800MB • Size propagation / memory estimates 1.6GB X • Integrated CP / Spark runtime r’(CP) X 1,1 (persisted in  Distributed Matrices MEM_DISK) X 2,1 y • Fixed-size (squared) matrix blocks • Data-parallel operations X m,1 Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

SystemML Overview and Related Work Cumulative Aggregates in ML Systems 8 (Straw-man Scripts and Built-in Support) 1: cumsumN2 = function ( Matrix [ Double ] A) 1: cumsumNlogN = function ( Matrix [ Double ] A) 2: return ( Matrix [ Double ] B) 2: return(Matrix[Double] B) 3: { 3: { 4: B = A; csums = matrix (0,1,ncol(A)); 4: B = A; m = nrow (A); k = 1; 5: for ( i in 1: nrow (A) ) { 5: while ( k < m ) { 6: csums = csums + A[i,]; 6: B[(k+1):m,] = B[(k+1):m,] + B[1:(m-k),]; 7: B[i,] = csums; 7: k = 2 * k; 8: } 8: } copy-on-write     O(n^2)     O(n log n) 9: } 9: }     Qualify for update in-place, but still too slow  ML Systems  Update in-place: R (ref count), SystemML (rewrites), Julia  Builtins in R, Matlab, Julia, NumPy, SystemML (since 2014) cumsum (), cummin (), cummax (), cumprod ()  SQL  SELECT Rid, V, sum (V) OVER ( ORDER BY Rid) AS cumsum FROM X  Sequential and parallelized execution (e.g., [Leis et al, PVLDB’15]) Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

9 Data-Parallel Cumulative Aggregates Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

Data-Parallel Cumulative Aggregates DistCumAgg Framework 10  Basic Idea: self-similar operator chain (forward, local, backward) block-local aggregates cumagg aggregates of aggregates Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

Data-Parallel Cumulative Aggregates Basic Cumulative Aggregates 11  Instantiating Operation Init f agg f off f cumagg Basic cumsum ( X ) 0 colSums ( B ) B 1: =B 1: +a cumsum ( B ) Cumulative cummin ( X ) ∞ colMins ( B ) B 1: =min(B 1: ,a) cummin ( B ) Aggregates cummax ( X ) -∞ colMaxs ( B ) B 1: =max(B 1: ,a) cummax ( B ) cumprod ( X ) 1 colProds ( B ) B 1: =B 1: *a cumprod ( B )  Example cumsum ( X ) fused to avoid copy Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

Data-Parallel Cumulative Aggregates Complex Cumulative Aggregates 12 1 .2 Exponential  Instantiating 1 .1 Z = cumsumprod ( X ) = cumsumprod ( Y , W ) smoothing 3 .0 with Z i = Y i + W i * Z i-1 , Z 0 =0 Complex 2 .1 Recurrences Init f agg f off f cumagg Equations 0 cbind ( cumsumprod ( B ) n1 , B 11 =B 11 +B 12 *a cumsumprod ( B ) prod ( B :2 ))  Example cumsumprod(X) Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

13 System Integration Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

System Integration Simplification Rewrites 14  Example #1: Suffix Sums  Problem: Distributed reverse causes data shuffling =  Compute via column aggregates and prefix sums rev ( cumsum ( rev ( X )))  X + colSums ( X ) – cumsum ( X ) (broadcast) (partitioning-preserving)  Example #2: Extract Lower Triangular  Problem: Indexing cumbersome/slow; cumsum densifying 1 0 0 0 0 0 0 1 1 0 0 0 0 0  Use dedicated operators 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 0 0 X * cumsum ( diag ( matrix (1, nrow ( X ),1))) 1 1 1 1 1 1 0  lower.tri ( X ) 1 1 1 1 1 1 1 Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

System Integration Execution Plan Generation 15  Compilation Chain of Cumulative Aggregates  Execution type selection based on memory estimates  Physical operator config (broadcast, aggregation, in-place, #threads)  Example Low-Level Runtime Plan Operators (LOPs) in-place High-Level #threads SP , k+, Operator (HOP) cumagg broadcast 1: ... 2: SP ucumack+ _mVar1 _mVar2 3: CP ucumk+ _mVar2 _mVar3 24 T u(cumsum) CP , 24, 4: CP rmvar _mVar2 u(cumsum) in-place 5: SP bcumoffk+ _mVar1 _mVar3 cumagg _mVar4 0 T SP , k+ 6: CP rmvar _mVar1 _mVar3 X nrow(X) ≥ b 7: ... X broadcast Matthias Boehm, Alexandre V. Evfimievski, and Berthold Reinwald: Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning, BTW 2019

Efficient Data-Parallel Cumulative Aggregates for Large-Scale - PowerPoint PPT Presentation

1 SCIENCE PASSION TECHNOLOGY Efficient Data-Parallel Cumulative Aggregates for Large-Scale Machine Learning Matthias Boehm 1 , Alexandre V. Evfimievski 2 , Berthold Reinwald 2 1 Graz University of Technology; Graz, Austria 2 IBM Research

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

AGGREGATES AND POZZOLANIC MATERIALS OVERVIEW Presented by Tom Adams, P.E. April 10, 2018

Evaluation of cumulative impact of Evaluation of cumulative impact of Evaluation of cumulative

OVERVIEW OF OVERVIEW OF CUMULATIVE EFFECTS CUMULATIVE EFFECTS ASSESSMENT ASSESSMENT What is

Breedon Aggregates Breedon Aggregates Full-year 2013 results Preliminary results 4 March 2014

An introduction to Breedon Aggregates October 2013 Peter Tom Simon Vivian Introduction Peter

Socially and Environmentally Responsible Aggregates (SERA) Andrea Bourrie Dufferin Aggregates

QoS-aware Energy-Efficient Algorithms for Ethernet Link Aggregates in Software-Defined Networks

Wanaque Reservoir TMDL and Wanaque Reservoir TMDL and Cumulative WLAs/LA for the Cumulative

Outline Outline 2 Joint Cumulative Distribution Function (4.1, Joint Cumulative

t= 1 train err= 7.9% test err= 17.8% 1 cumulative distribution 0.5 -1 -0.5 0 0.5 1

Internal Curing Using Prewetted Lightweight Aggregates Improving Concrete Durability and

Recycled Aggregates Brian James, MPA, UK Chair of UEPG Recycling Task Force 7 December 2017 UNI

Overview DW Performance Optimization Choosing aggregates Maintaining views

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Administrative - how is the assignment going? - btw, the notes get updated all the time based on

Manufacturing Test Strategy Cost Model Rosa Reinosa Carlos Michel Hewlett-Packard Company

Rare B Decays and CP Violation Beyond the Standard Model Prospects for New Physics New

Continuous Integration im Rechenzentrum Michael Prokop Roadmap Begriffsklrung + Grnde

Semantic Subtyping for Session Types Luca Padovani Dipartimento di Informatica, Universit di

Quantum algorithms for Information Set Decoding Elena Kirshanova ENS Lyon April 11, 2018 Quantum

Bonsai in the Fog: an Active Learning Lab with Fog Computing Antonio Brogi, Stefano Fort orti,

Towards Automated Polyglot Persistence Michael Schaarschmidt, Felix Gessert, Norbert Ritter