A Massively Parallel Dense Symmetric A Massively Parallel Dense - PowerPoint PPT Presentation

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric Eigensolver with Communication Eigensolver with Communication g Splitting Multicasting Algorithm Splitting Multicasting Algorithm Takahiro Katagiri (Information Technology Center, The University of Tokyo) Shoji Itoh (Advanced Center for Computing and Communication, RIKEN) (Currently, with Information T echnology Center, The University of T okyo) VECPAR’10 9 th I 9 th International Meeting High Performance Computing for Computational Science l M H h P f C f C l S CITRIS, UC Berkeley, CA, USA June 23 (Wednesday) Session VI: Solvers on Emerging Architectures (Room 250) 1 18:00 – 18:25 (25 min.)

Outlines Outlines  Background  Communication Splitting Multicasting C S l M l Algorithm for Symmetric Dense Eigensolver  Performance Evaluation  Performance Evaluation ◦ T2K Open Supercomputer (U.Tokyo)  AMD Opteron Quad Core (Barcelona) AMD O Q d C (B l ) ◦ RICC PRIMERGY RX200S (RIKEN)  Intel Xeon X5570 Quad Core (Nehalem)  Conclusion 2

BACK GROUND BACK GROUND 3

Issue To Establish 100,000 Parallelism Issue To Establish 100,000 Parallelism  Need for New “Design Space”  Need for New Design Space 1. Load Imbalance  Big blocking length for data distribution and computation g g g p damages load balance in massively parallel processing (MPP).  In ScaLAPACK, one “big” block size is used in BLAS operation and data distribution.  Ex: Block size 160  Minimum Executable Matrix Size  In the case of 10,000 cores : The size is 16,000.  In the case of 100,000 cores: The size is 50,596.  The whole matrix size is NOT small!  Execution with the minimal sizes causes very heavy load imbalance. 2 2. Communication Pattern and Performance C i i P d P f  1D Data Distribution : All cores are occupied in one collective operation.  In 1D Data Distribution : MPI_ALLREDICE with 10,000 cores 1 group MPI ALLREDICE with 10 000 cores * 1 group  In 2D Data Distribution : MPI_ALLREDUCE with 100 cores * (100 groups simultaneously) 3. 3. Communication Hiding Implementation Communication Hiding Implementation  Previously Computation and Non-blocking Communication 4

The Aim of This Study The Aim of This Study  T o Establish Eigensolver Algorithm for Small Sized Matrix and MPP ◦ Conventional Design Space  Small-scale parallelism : Up to 1,000 cores. p p  “Ultra” Large Scale Execution for the Matrix on MPP: 100,000~1,000,000 Size of Matrix Dimension.  It is too big to perform the solver in actual supercomputer service  It is too big to perform the solver in actual supercomputer service.  What is “Small Size” for the target? ◦ The work area size per core matches L1~L2 caches ◦ The work area size per core matches L1~L2 caches.  What is MPP for the target? ◦ From 10,000 Cores To 100,000 Cores. F 10 000 C T 100 000 C ◦ Flat MPI Model.  Hybrid MPI is also covered if we can establish principal MPP H b id MPI i l d if t bli h i i l MPP algorithm. 5

Our Design Space for the Solver Our Design Space for the Solver 1. Improvement of Load Imbalance  Use “Non-blocking” Algorithm  Data distribution size can be permanently ONE.  No load imbalance cased by the data distribution is happen.  Do Not Use “Symmetricity”  Do Not Use Symmetricity   Simple Computation Kernel and High Parallelism.   Increase Computation Complexity. 2. Data Distribution for MPP  Use 2D Cyclic Distribution y  (Cyclic, Cyclic)Distribution with Size of One :Perfect Load Balancing  Multi-casting Communication: Reduce communication time of MPI BCAST and MPI ALLREDUCE even if the number of cores MPI_BCAST and MPI_ALLREDUCE even if the number of cores or vector size are increased.  Use Duplication of Pivot Vectors  Reduce gathering communication time. 3. Future work: Communication Hiding 6

AN EIGENSOLVER AN EIGENSOLVER ALGORITHM FOR ALGORITHM FOR ALGORITHM FOR ALGORITHM FOR MASSIVELY PARALLEL MASSIVELY PARALLEL COMPUTING COMPUTING 7

The Eigenvalue Program The Eigenvalue Program  Standard Eigenproblem   x  A Ax x x ： An Eigenvector  ： An Eigenvalue  • Application Field pp • Several Science and Technology Problems • Quantum Chemistry Dense and symmetric. • Requires most parts of eigenvalues and eigenvectors. • • Searching on the Internet (Knowledge Discovery) Searching on the Internet (Knowledge Discovery) [M.Berry et.al., 1995] 3 • Dense: Computational Complexity is Dense: Computational Complexity is O O ( ( n n ) ) • Needs to implement parallelization. 8

A Classical Sequential Algorithm     Ax Ax x x (Standard Eigenproblem (Standard Eigenproblem ) ) Symmetric Dense Matrix 2. 2. 2 Bisection Bisection Bisection Bisection 1. 1. Householder Householder Transformation Transformation T: Tridiagonal matrix T: Tridiagonal matrix All eigenvalues : Λ All eigenvalues : Λ All eigenvalues : All eigenvalues : T 3. 3. Inverse Inverse A Iteration Iteration Iteration Iteration ＱＴ AＱ=T ＱＴ AＱ=T =T =T T T : Tridiagonal matrix Tri- Tri -diagonalization diagonalization Tridiagonal All eigenvectors: Y All eigenvectors: Y All eigenvectors: All eigenvectors: 3 3 matrix matrix O O ( ( n ) ) 2 3 O ( n ) ~ O ( n ) Ｑ=H =H 1 1 H 2 … 2 … H n- -2 2 4. Householder 4 4. Householder Inverse H H h ld h ld I Inverse 2 MRRR: MRRR: O ( n ) Transformation Transformation A D A D A: Dense matrix A: Dense matrix t i t i 3 All eigenvectors All eigenvectors ： X = O ( n ) X = ＱＱY Y 9

Basic Operations of Householder Basic Operations of Householder Tridiagonalization ( Tridiagonalization (Non Tridiagonalization ( Tridiagonalization (Non Non-blocking Non blocking blocking Ver.) blocking Ver) Ver.) Ver) ( k ) Let A be a matrix when the k-th iteration. The operations in the k-th iteration are follows. p      1 n ( , u ) k k  ( k ) A ( , u )  :Householder Refraction k : n , k k k T T    H H I I u u :Householder Operator k k k k k   ( k 1 ) ( k ) A H A H :HouseholderTransformation k k   do k 1 , n 2 T T     ( k ) y y u u A A : ① Matrix-vector Multiplication : ① Matrix vector Multiplication k k : n , k k : n k k k T    y u : ② Dot-products k k k k  x y : ③ Copy (when symmetric) k k       T  T ( k ) ( k ) H H A A H H A A ( ( x x u u ) ) u u u u y y k k k k k k k k : ④ Matrix Updating 10

Communication Time Reduction For Householder Tridiagonalization For Householder Tridiagonalization 2D Cyclic 2D Cyclic- -Cyclic Distribution Cyclic Distribution ＰＥ１ Multi- Multi- ＰＥ２ＰＥ２＋ Broadcasts ＰＥ３ＰＥ４ Drawback  It increases number of communication.  Perfect Load-balancing Merit p : #Processes  Reduces Communication Volume n n : Problem Size : Problem Size 2  O ( n log p ) 2 O ( n / p log p ) 2 2 11

An Example: HITACHI SR2201(Yr.2000) An Example: HITACHI SR2201(Yr.2000) (Hessenberg (Hessenberg Hessenberg Reduction: Hessenberg Reduction: Reduction: ｎ =4096) Reduction: ｎ =4096) =4096) =4096) 1000 sec] (*,Block) Time [s 100 (*,Cyclic) T 1D (Bl (Block,Block) k Bl k) Distribution (Cyclic,Cyclic) 2D Distribution 10 4 16 32 64 128 256 #Processes

Effect on 2D Distribution in Our Method Effect on 2D Distribution in Our Method (HITACHI SR2201(Yr. 2000), (HITACHI SR2201(Yr. 2000), Householder (HITACHI SR2201(Yr. 2000), (HITACHI SR2201(Yr. 2000), Householder HouseholderTridiagonalization HouseholderTridiagonalization Tridiagonalization ） Tridiagonalization ） Time [Second] 4 Processes 64 Processes Time [Second] 10000 1000 1000 1000 100 100 5.4x 10 10 3 6 3.6x 10 10 1 1 0.1 0 1 0.1 0.01 100 200 400 800 1000 2000 4000 8000 100 200 400 800 1000 2000 4000 8000 Problem Size Our Problem Size ScaLAPACK Our ScaLAPACK Time [Second] 128 Processes 128 Processes 512 Processes 512 Processes Time [Second] Time [Second] 1000 1000 325 Seconds 100 4.6x 100 81 Seconds 10 5.7x 10 1 1 0.1 1000 2000 4000 8000 10000 20000 200 400 800 1000 2000 4000 8000 10000 Problem Size Problem Size ScaLAPACK Our ScaLAPACK Our

Whole Parallel Processes on the Eigensolver Whole Parallel Processes on the Eigensolver Gather Tridiagonalization All Elements T T T T A T T T Compute Upper and Lower limits For eigenvalues Compute Eigenvectors Compute Eigenvectors Upper Λ Lower １， 2 ， 3 ， 4… (Rising Order) １， 2 ， 3 ， 4… (Rising Order) Gather Y All Eigenvalues Λ １， 2 ， 3 ， 4… （ Corresponding to Rising Order for the eigenvalues 14

Data Duplication for the Tridiagonalization u k u k Duplication of Duplication of Matrix Matrix A A u k :Vectors :Vectors u k , , x k y k Duplication of y k ｐ Processes ｐ :Vectors :Vectors y k, , ｑ Processes 15

A Massively Parallel Dense Symmetric A Massively Parallel Dense - PowerPoint PPT Presentation

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric Eigensolver with Communication Eigensolver with Communication g Splitting Multicasting

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Class 42: Free symmetric top Class 42: Free symmetric top Free symmetric top in body frame Assume

Dense Flow Visualization Lecture 10 February 27, 2020 General Overview Dense methods in 2D

Breaking the Linear-Memory Barrier in Massively Parallel Computing MIS on Trees with Strongly

Loosely Dependent Parallel Processes Complementary Paradigms Massively Parallel Task

Inequalities for Symmetric Polynomials Curtis Greene October 24, 2009 Inequalities for Symmetric

Dense cold mixes: Preservation of Dense cold mixes: Preservation of county roads county roads

Massively Parallel A* Search on a GPU Yichao Zhou Jianyang Zeng Institute for Interdisciplinary

Massively Parallel Communication and Query Evaluation Paul Beame U. of Washington Based on

MPMPLAPACK: A Massively Parallel Multi-Precision Linear Algebra Package Jason Martin

An Introduction to Symmetric Functions Ira M. Gessel Department of Mathematics Brandeis

Outline Crypto intro Computer Security: Secret Key Crypto Symmetric crypto Achieving security

Outline Crypto intro Computer Security: Secret Key Crypto Symmetric crypto Bart Jacobs

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

Symmetric Designs Lucia Moura School of Electrical Engineering and Computer Science University

0 4 1 3 2 No deterministic symmetric dining solution [RL81] Probabilistic symmetric

Chapter IX: Matrix factorizations Information Retrieval & Data Mining Universitt des

Laurette TUCKERMAN laurette@pmmh.espci.fr Numerical Methods for Differential Equations in

Lecture 12: Dense Linear Algebra David Bindel 3 Mar 2010 HW 2 update I have speed-up over

Parallel Eigensolver for Graph Spectral Analysis on GPU Yimin Liu Heran Lin

Natural Language Processing Class is now big enough for big class policies Late days: 7

CS 310 Advanced Data Structures and Algorithms Searching June 14, 2018 Mohammad Hadian

Review: summary of the performance of symbol-table implementations Order of growth of the

Algorithms theory 15 Text search (1) Prof. Dr. S. Albers Winter term 07/08 Text search

Sambuz

Useful Links

Newsletter

Mail Us

A Massively Parallel Dense Symmetric A Massively Parallel Dense - PowerPoint PPT Presentation

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric Eigensolver with Communication Eigensolver with Communication g Splitting Multicasting

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

Class 42: Free symmetric top Class 42: Free symmetric top Free symmetric top in body frame Assume

Dense Flow Visualization Lecture 10 February 27, 2020 General Overview Dense methods in 2D

Breaking the Linear-Memory Barrier in Massively Parallel Computing MIS on Trees with Strongly

Loosely Dependent Parallel Processes Complementary Paradigms Massively Parallel Task

Inequalities for Symmetric Polynomials Curtis Greene October 24, 2009 Inequalities for Symmetric

Dense cold mixes: Preservation of Dense cold mixes: Preservation of county roads county roads

Massively Parallel A* Search on a GPU Yichao Zhou Jianyang Zeng Institute for Interdisciplinary

Massively Parallel Communication and Query Evaluation Paul Beame U. of Washington Based on

MPMPLAPACK: A Massively Parallel Multi-Precision Linear Algebra Package Jason Martin

An Introduction to Symmetric Functions Ira M. Gessel Department of Mathematics Brandeis

Outline Crypto intro Computer Security: Secret Key Crypto Symmetric crypto Achieving security

Outline Crypto intro Computer Security: Secret Key Crypto Symmetric crypto Bart Jacobs

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices &amp; quadratic forms)

Symmetric Designs Lucia Moura School of Electrical Engineering and Computer Science University

0 4 1 3 2 No deterministic symmetric dining solution [RL81] Probabilistic symmetric

Chapter IX: Matrix factorizations Information Retrieval &amp; Data Mining Universitt des

Laurette TUCKERMAN laurette@pmmh.espci.fr Numerical Methods for Differential Equations in

Lecture 12: Dense Linear Algebra David Bindel 3 Mar 2010 HW 2 update I have speed-up over

Parallel Eigensolver for Graph Spectral Analysis on GPU Yimin Liu Heran Lin

Natural Language Processing Class is now big enough for big class policies Late days: 7

CS 310 Advanced Data Structures and Algorithms Searching June 14, 2018 Mohammad Hadian

Review: summary of the performance of symbol-table implementations Order of growth of the

Algorithms theory 15 Text search (1) Prof. Dr. S. Albers Winter term 07/08 Text search

Sambuz

Useful Links

Newsletter

Mail Us

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

Chapter IX: Matrix factorizations Information Retrieval & Data Mining Universitt des