MadLINQ: Large-Scale Distributed Matrix Computation for the Cloud - PowerPoint PPT Presentation

MadLINQ: Large-Scale Distributed Matrix Computation for the Cloud Zhengping Qian, Xiuwei Chen, Nanxi Kang, Mingcheng Chen, Yuan Yu, Thomas Moscibroda, Zheng Zhang Philip Leonard October 27, 2014 Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 1 / 17

Motivation MapReduce and DryadLINQ relational algebra operators not suitable for linear algebra computations Demand for efficient matrix computations; ◮ Machine learning ◮ Graph algorithms (graphs boil down to sparse matrices) Previous attempts failed to deliver; ◮ ScaLAPACK [2] too low level (MPI Knowledge required) ◮ HAMA built on top of MapReduce (still restrictive) Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 2 / 17

Key Components of MadLINQ Simple programming model for matrix computation New Fine Grained Pipelining (FGP) model Fault tolerance for FGP Integration with DryadLINQ [3] Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 3 / 17

Tile Algorithms A tile is a sub-matrix. Entire matrix is partitioned into a grid of tiles. This idea is what gives rise to parallelism in matrix computation. Aim is to maximise cache localisation by exploiting the structured access of matrix algorithms. Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 4 / 17

Computation Example: Cholesky Decomposition Takes a symmetric positive-definite matrix Matrix is partitioned into tiles On the k -th iteration, tile operations employed to factorise; ◮ diagonally ( DPOTRF ) ◮ n − k tiles below ( DTRSM ) ◮ trailing tiles to the right ( DSYRK and DGEMM ) Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 5 / 17

Cholesky Decomposition Iteration Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 6 / 17

Programming Model C# constructs, allows DryadLINQ and MadLINQ integration. Matrix data abstraction in C# encapsulates tile representation. Programs expressed in a sequential fashion. Linear algebra library in C# Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 7 / 17

Example Application: Collaborative Filtering How to predict what other movies users will like given their rating of other movies. R [ i , j ] is user j ’s rating of movie i . CF Equation ( R · R T ) · R becomes; CF MadLINQ Code Matrix s i m i l a r i t y = R. M u l t i p l y (R. Transpose ( ) ) ; Matrix s c o r e s = s i m i l a r i t y . M u l t i p l y (R ) . Normalize ( ) ; * Matrix goes from sparse (users haven’t seen most movies) to dense (every user has predicted score for every movie) Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 8 / 17

CF: Integration with DryadLINQ DryadLINQ processes Netflix dataset This boils down to a MadLINQ Matrix MadLINQ does transposition, matrix multiplication and normalisation of R to get scores DryadLINQ generates top 5 list of movies for each user. Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 9 / 17

MadLINQ Architecture Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 10 / 17

Directed Acyclic Graph (DAG) DAG is dynamically expanded through symbolic execution to prevent explosion ( O ( n 3 ) for Cholesky Decomposition) f 1 through f 4 are the tile operators discussed earlier ( DPOTRF, DTRSM, DSYRK and DGEMM resp.) Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 11 / 17

FGP & Fault Tolerance Parallelism fluctuates with matrix computations Pipelining exploits vertex parallelism by increasing data granularity (recursively tiling matrices) Failure handling: Input blocks can be reconstructed from output blocks. Dependencies are calculated to reduce recovery cost. Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 12 / 17

Optimisations & Configuration From the authors experience, optimisations were made; ◮ Prefetching of vertex data for a close to terminating node ◮ Specifying order of matrix data (column or row first?) ◮ Auto switching between sparse (compressed) and dense matrices. Configuration; ◮ Smaller tiles = ⇒ higher parallelism ◮ Granularity of computation is a block ◮ Block size determined by number of non-zero elements Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 13 / 17

Observations & Applications Observations; ◮ Pipelining performs better on larger problems ◮ Pipeline approach on average 14.4% faster than ScaLAPACK ◮ ScaLAPACK failed consistently using 32 cores (with no fault tolerance) Real world applications; ◮ MadLINQ more efficient than MapReduce ◮ For Collaborative Filtering (recall ( R · R T ) · R ) on 20k × 500k matrix (Netflix challenge). Mahout over Hadoop took over 800 minutes, as opposed to MadLINQ 16 (albeit Mahout produces results of higher accuracy) Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 14 / 17

Conclusion & Related Work DAGuE, a similar use of DAG for tiled algorithms but failed to provide fault tolerance and resource dynamics Future research ideas; ◮ Auto-Tiling of matrices for matrix algorithms ◮ Dynamic Re-Tiling (dynamic changing of tile sizes for graph algorithms) ◮ Sparse matrices cause load imbalance. Methods required for handling these well. Concludes MadLINQ fills the void that is large scale distributed matrix and graph processing, using linear algebra. Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 15 / 17

References I Zhengping Qian, Xiuwei Chen, Nanxi Kang, Mingcheng Chen, Yuan Yu, Thomas Moscibroda, Zheng Zhang MadLINQ: Large-Scale Distributed Matrix Computation for the Cloud EuroSys, 2012. Choi, J., Dongarra , J., Pozo , R., Walker , D. Scalapack: A scalable linear algebra library for distributed memory concurrent computers. In Symposium on the Frontiers of Massively Parallel Computation, 1992. Yu, Y., Isard, M., Fetterly, D., Et Al . DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008. Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 16 / 17

Questions Philip Leonard (University of Cambridge) MadLINQ October 27, 2014 17 / 17

MadLINQ: Large-Scale Distributed Matrix Computation for the Cloud - PowerPoint PPT Presentation

MadLINQ: Large-Scale Distributed Matrix Computation for the Cloud Zhengping Qian, Xiuwei Chen, Nanxi Kang, Mingcheng Chen, Yuan Yu, Thomas Moscibroda, Zheng Zhang Philip Leonard October 27, 2014 Philip Leonard (University of Cambridge)

MadLINQ: Large-Scale Distributed Matrix Computation for the Cloud Zhengping Qian, Xiuwei Chen,

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Knowledge Representation of Large Scale Knowledge Representation of Distributed

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Matrix COSEC Right People in Right Place at Right Time Matrix COmplete SECurity Matrix COSEC

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Complexity of matrix multiplication (For Hierarchical matrix) For Usual matrix The

Communication-avoiding Cholesky-QR2 for rectangular matrices Edward Hutter and Edgar Solomonik

LU Decomposition INGE4035 Numerical Methods Applied to Engineering Dr. Marco A Arocha October 2,

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des

Chapter 23 Two Categorical Variables: The Chi-Square Test Chapter 22 1 BPS - 5th Ed.

Synthesis of certified programs in fixed-point arithmetic, and its application to linear algebra

Measuring Energy and Power with PAPI Vince Weaver vweaver1@eecs.utk.edu 11 May 2012 Power and

A three-level BDDC algorithm for mortar discretization Hyea Hyun Kim National Institute for

Can Systems be Certified Distributively? Scalable Analysis Methods for Sparse Large-scale Systems