high performance matrix inversion based on lu
play

High Performance Matrix Inversion Based on LU Factorization for - PowerPoint PPT Presentation

Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Piotr Luszczek High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures presented by Piotr Luszczek Preliminaries Problem Statement n n A R PA = LU 1 U


  1. Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Piotr Luszczek High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures presented by Piotr Luszczek

  2. Preliminaries

  3. Problem Statement n × n A ∈ R PA = LU − 1 U → U − 1 L → L − 1 ∈ R n × n A

  4. To Keep in Mind... In the vast majority of practical computational problems, it is unnecessary and inadvisable to actually compute A -1 . Forsythe, Malcolm, and Moler

  5. Data Layouts for Matrix Elements Column-major (LAPACK and derivatives) Tile (PLASMA)

  6. Tasks and DAGs

  7. Block LU Inversion Tile LU Inversion For each panel LU factorization For each diagonal tile ● ● DGETF2( ) -DGETRFR() parallel recursive LU DLASWP( ) for each tail tile panel DLASWP( ) -DLASWP( ) DTRSM( ) for each tail tile DGEMM( ) -DGEMM( ) for each left tile panel For each panel Invert U ● -DLASWP( ) DTRMM( ) DTRSM( ) For each diagonal tile Invert U ● DTRTI2( ) for each tile in panel -DTRSM( ) For each panel Invert L ● for each tail tile DLACPY( ) -DGEMM( ) DLASET( ) for each left panel tile DGEMM( ) -DTRSM( ) DTRSM( ) -DTRTRI( ) For each left tile Invert L ● DLASWP( ) column interchanges ● -DLACPY( ) -DLASET( ) ...

  8. Queuing Functions with QUARK QUARK_Insert_Task( panel_LU_task, M, matrix_1 , INPUT, N, matrix_2 , INOUT, 1, result , OUTPUT, K, buffer , SCRATCH, 0);

  9. DAGs of Tasks, Each State Separately 3 – Computation of U -1 1 – LU Factorization 4 – Column swapping 2 – Computation of L -1

  10. DAGs of Tasks, All Stages Overlapped

  11. Execution Traces No Overlap of Stages Overlap of Stages

  12. The Case for Nested Parallelism

  13. Panel Factorization as the Sequential Bottleneck xGETRF-REC Swap + xTRSM Swap + xTRSM xGEMM xGEMM xGETRF-REC xGEMM xGEMM

  14. Panel Factorization is On Critical Path of DAG

  15. Parallel Panel Factorization: Data Partitioning

  16. Parallel Panel Factorization: Algorithm

  17. Quick Performance Experiment

  18. Results

  19. Performance on AMD MagnyCours, 4x12=48 cores

  20. LU Inversion's Power Profile: LAPACK

  21. LU Inversion's Power Profile: MKL

  22. LU Inversion's Power Profile: PLASMA

  23. PLASMA LAPACK MKL This work was sponsored by NSF, DOE, and Microsoft

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend