a massively parallel dense symmetric a massively parallel
play

A Massively Parallel Dense Symmetric A Massively Parallel Dense - PowerPoint PPT Presentation

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric Eigensolver with Communication Eigensolver with Communication g Splitting Multicasting


  1. A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric Eigensolver with Communication Eigensolver with Communication g Splitting Multicasting Algorithm Splitting Multicasting Algorithm Takahiro Katagiri (Information Technology Center, The University of Tokyo) Shoji Itoh (Advanced Center for Computing and Communication, RIKEN) (Currently, with Information T echnology Center, The University of T okyo) VECPAR’10 9 th I 9 th International Meeting High Performance Computing for Computational Science l M H h P f C f C l S CITRIS, UC Berkeley, CA, USA June 23 (Wednesday) Session VI: Solvers on Emerging Architectures (Room 250) 1 18:00 – 18:25 (25 min.)

  2. Outlines Outlines  Background  Communication Splitting Multicasting C S l M l Algorithm for Symmetric Dense Eigensolver  Performance Evaluation  Performance Evaluation ◦ T2K Open Supercomputer (U.Tokyo)  AMD Opteron Quad Core (Barcelona) AMD O Q d C (B l ) ◦ RICC PRIMERGY RX200S (RIKEN)  Intel Xeon X5570 Quad Core (Nehalem)  Conclusion 2

  3. BACK GROUND BACK GROUND 3

  4. Issue To Establish 100,000 Parallelism Issue To Establish 100,000 Parallelism  Need for New “Design Space”  Need for New Design Space 1. Load Imbalance  Big blocking length for data distribution and computation g g g p damages load balance in massively parallel processing (MPP).  In ScaLAPACK, one “big” block size is used in BLAS operation and data distribution.  Ex: Block size 160  Minimum Executable Matrix Size  In the case of 10,000 cores : The size is 16,000.  In the case of 100,000 cores: The size is 50,596.  The whole matrix size is NOT small!  Execution with the minimal sizes causes very heavy load imbalance. 2 2. Communication Pattern and Performance C i i P d P f  1D Data Distribution : All cores are occupied in one collective operation.  In 1D Data Distribution : MPI_ALLREDICE with 10,000 cores 1 group MPI ALLREDICE with 10 000 cores * 1 group  In 2D Data Distribution : MPI_ALLREDUCE with 100 cores * (100 groups simultaneously) 3. 3. Communication Hiding Implementation Communication Hiding Implementation  Previously Computation and Non-blocking Communication 4

  5. The Aim of This Study The Aim of This Study  T o Establish Eigensolver Algorithm for Small Sized Matrix and MPP ◦ Conventional Design Space  Small-scale parallelism : Up to 1,000 cores. p p  “Ultra” Large Scale Execution for the Matrix on MPP: 100,000~1,000,000 Size of Matrix Dimension.  It is too big to perform the solver in actual supercomputer service  It is too big to perform the solver in actual supercomputer service.  What is “Small Size” for the target? ◦ The work area size per core matches L1~L2 caches ◦ The work area size per core matches L1~L2 caches.  What is MPP for the target? ◦ From 10,000 Cores To 100,000 Cores. F 10 000 C T 100 000 C ◦ Flat MPI Model.  Hybrid MPI is also covered if we can establish principal MPP H b id MPI i l d if t bli h i i l MPP algorithm. 5

  6. Our Design Space for the Solver Our Design Space for the Solver 1. Improvement of Load Imbalance  Use “Non-blocking” Algorithm  Data distribution size can be permanently ONE.  No load imbalance cased by the data distribution is happen.  Do Not Use “Symmetricity”  Do Not Use Symmetricity   Simple Computation Kernel and High Parallelism.   Increase Computation Complexity. 2. Data Distribution for MPP  Use 2D Cyclic Distribution y  (Cyclic, Cyclic)Distribution with Size of One :Perfect Load Balancing  Multi-casting Communication: Reduce communication time of MPI BCAST and MPI ALLREDUCE even if the number of cores MPI_BCAST and MPI_ALLREDUCE even if the number of cores or vector size are increased.  Use Duplication of Pivot Vectors  Reduce gathering communication time. 3. Future work: Communication Hiding 6

  7. AN EIGENSOLVER AN EIGENSOLVER ALGORITHM FOR ALGORITHM FOR ALGORITHM FOR ALGORITHM FOR MASSIVELY PARALLEL MASSIVELY PARALLEL COMPUTING COMPUTING 7

  8. The Eigenvalue Program The Eigenvalue Program  Standard Eigenproblem   x  A Ax x x : An Eigenvector  : An Eigenvalue  • Application Field pp • Several Science and Technology Problems • Quantum Chemistry Dense and symmetric. • Requires most parts of eigenvalues and eigenvectors. • • Searching on the Internet (Knowledge Discovery) Searching on the Internet (Knowledge Discovery) [M.Berry et.al., 1995] 3 • Dense: Computational Complexity is Dense: Computational Complexity is O O ( ( n n ) ) • Needs to implement parallelization. 8

  9. A Classical Sequential Algorithm     Ax Ax x x (Standard Eigenproblem (Standard Eigenproblem ) ) Symmetric Dense Matrix 2. 2. 2 Bisection Bisection Bisection Bisection 1. 1. Householder Householder Transformation Transformation T: Tridiagonal matrix T: Tridiagonal matrix All eigenvalues : Λ All eigenvalues : Λ All eigenvalues : All eigenvalues : T 3. 3. Inverse Inverse A Iteration Iteration Iteration Iteration Q T AQ=T Q T AQ=T =T =T T T : Tridiagonal matrix Tri- Tri -diagonalization diagonalization Tridiagonal All eigenvectors: Y All eigenvectors: Y All eigenvectors: All eigenvectors: 3 3 matrix matrix O O ( ( n ) ) 2 3 O ( n ) ~ O ( n ) Q=H =H 1 1 H 2 … 2 … H n- -2 2 4. Householder 4 4. Householder Inverse H H h ld h ld I Inverse 2 MRRR: MRRR: O ( n ) Transformation Transformation A D A D A: Dense matrix A: Dense matrix t i t i 3 All eigenvectors All eigenvectors : X = O ( n ) X = Q QY Y 9

  10. Basic Operations of Householder Basic Operations of Householder Tridiagonalization ( Tridiagonalization (Non Tridiagonalization ( Tridiagonalization (Non Non-blocking Non blocking blocking Ver.) blocking Ver) Ver.) Ver) ( k ) Let A be a matrix when the k-th iteration. The operations in the k-th iteration are follows. p      1 n ( , u ) k k  ( k ) A ( , u )  :Householder Refraction k : n , k k k T T    H H I I u u :Householder Operator k k k k k   ( k 1 ) ( k ) A H A H :HouseholderTransformation k k   do k 1 , n 2 T T     ( k ) y y u u A A : ① Matrix-vector Multiplication : ① Matrix vector Multiplication k k : n , k k : n k k k T    y u : ② Dot-products k k k k  x y : ③ Copy (when symmetric) k k       T  T ( k ) ( k ) H H A A H H A A ( ( x x u u ) ) u u u u y y k k k k k k k k : ④ Matrix Updating 10

  11. Communication Time Reduction For Householder Tridiagonalization For Householder Tridiagonalization 2D Cyclic 2D Cyclic- -Cyclic Distribution Cyclic Distribution PE1 Multi- Multi- PE2 PE2 + Broadcasts PE3 PE4 Drawback  It increases number of communication.  Perfect Load-balancing Merit p : #Processes  Reduces Communication Volume n n : Problem Size : Problem Size 2  O ( n log p ) 2 O ( n / p log p ) 2 2 11

  12. An Example: HITACHI SR2201(Yr.2000) An Example: HITACHI SR2201(Yr.2000) (Hessenberg (Hessenberg Hessenberg Reduction: Hessenberg Reduction: Reduction: n =4096) Reduction: n =4096) =4096) =4096) 1000 sec] (*,Block) Time [s 100 (*,Cyclic) T 1D (Bl (Block,Block) k Bl k) Distribution (Cyclic,Cyclic) 2D Distribution 10 4 16 32 64 128 256 #Processes

  13. Effect on 2D Distribution in Our Method Effect on 2D Distribution in Our Method (HITACHI SR2201(Yr. 2000), (HITACHI SR2201(Yr. 2000), Householder (HITACHI SR2201(Yr. 2000), (HITACHI SR2201(Yr. 2000), Householder HouseholderTridiagonalization HouseholderTridiagonalization Tridiagonalization ) Tridiagonalization ) Time [Second] 4 Processes 64 Processes Time [Second] 10000 1000 1000 1000 100 100 5.4x 10 10 3 6 3.6x 10 10 1 1 0.1 0 1 0.1 0.01 100 200 400 800 1000 2000 4000 8000 100 200 400 800 1000 2000 4000 8000 Problem Size Our Problem Size ScaLAPACK Our ScaLAPACK Time [Second] 128 Processes 128 Processes 512 Processes 512 Processes Time [Second] Time [Second] 1000 1000 325 Seconds 100 4.6x 100 81 Seconds 10 5.7x 10 1 1 0.1 1000 2000 4000 8000 10000 20000 200 400 800 1000 2000 4000 8000 10000 Problem Size Problem Size ScaLAPACK Our ScaLAPACK Our

  14. Whole Parallel Processes on the Eigensolver Whole Parallel Processes on the Eigensolver Gather Tridiagonalization All Elements T T T T A T T T Compute Upper and Lower limits For eigenvalues Compute Eigenvectors Compute Eigenvectors Upper Λ Lower 1, 2 , 3 , 4… (Rising Order) 1, 2 , 3 , 4… (Rising Order) Gather Y All Eigenvalues Λ 1, 2 , 3 , 4… ( Corresponding to Rising Order for the eigenvalues 14

  15. Data Duplication for the Tridiagonalization u k u k Duplication of Duplication of Matrix Matrix A A u k :Vectors :Vectors u k , , x k y k Duplication of y k p Processes p :Vectors :Vectors y k, , q Processes 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend