High Performance Matrix Inversion Based on LU Factorization for - - PowerPoint PPT Presentation

high performance matrix inversion based on lu
SMART_READER_LITE
LIVE PREVIEW

High Performance Matrix Inversion Based on LU Factorization for - - PowerPoint PPT Presentation

Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Piotr Luszczek High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures presented by Piotr Luszczek Preliminaries Problem Statement n n A R PA = LU 1 U


slide-1
SLIDE 1

Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Piotr Luszczek

High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures

presented by Piotr Luszczek

slide-2
SLIDE 2

Preliminaries

slide-3
SLIDE 3

Problem Statement A∈R

n×n

PA=LU L → L

−1

U →U

−1

A

−1∈R n×n

slide-4
SLIDE 4

To Keep in Mind... In the vast majority

  • f

practical computational problems, it is unnecessary and inadvisable to actually compute A-1.

Forsythe, Malcolm, and Moler

slide-5
SLIDE 5

Data Layouts for Matrix Elements

Column-major (LAPACK and derivatives) Tile (PLASMA)

slide-6
SLIDE 6

Tasks and DAGs

slide-7
SLIDE 7

Block LU Inversion Tile LU Inversion

  • For each panel

LU factorization DGETF2( ) DLASWP( ) DLASWP( ) DTRSM( ) DGEMM( )

  • For each panel

Invert U DTRMM( ) DTRSM( ) DTRTI2( )

  • For each panel

Invert L DLACPY( ) DLASET( ) DGEMM( ) DTRSM( )

  • DLASWP( )

column interchanges

  • For each diagonal tile
  • DGETRFR()parallel recursive LU

for each tail tile panel

  • DLASWP( )

for each tail tile

  • DGEMM( )

for each left tile panel

  • DLASWP( )
  • For each diagonal tile

Invert U for each tile in panel

  • DTRSM( )

for each tail tile

  • DGEMM( )

for each left panel tile

  • DTRSM( )
  • DTRTRI( )
  • For each left tile

Invert L

  • DLACPY( )
  • DLASET( )

...

slide-8
SLIDE 8

Queuing Functions with QUARK QUARK_Insert_Task( panel_LU_task, M, matrix_1 , INPUT, N, matrix_2 , INOUT, 1, result , OUTPUT, K, buffer , SCRATCH, 0);

slide-9
SLIDE 9

DAGs of Tasks, Each State Separately

1 – LU Factorization 2 – Computation of L-1 3 – Computation of U-1 4 – Column swapping

slide-10
SLIDE 10

DAGs of Tasks, All Stages Overlapped

slide-11
SLIDE 11

Execution Traces

No Overlap of Stages Overlap of Stages

slide-12
SLIDE 12

The Case for Nested Parallelism

slide-13
SLIDE 13

Panel Factorization as the Sequential Bottleneck

xGETRF-REC Swap + xTRSM xGEMM Swap + xTRSM xGEMM xGEMM xGEMM xGETRF-REC

slide-14
SLIDE 14

Panel Factorization is On Critical Path of DAG

slide-15
SLIDE 15

Parallel Panel Factorization: Data Partitioning

slide-16
SLIDE 16

Parallel Panel Factorization: Algorithm

slide-17
SLIDE 17

Quick Performance Experiment

slide-18
SLIDE 18

Results

slide-19
SLIDE 19

Performance on AMD MagnyCours, 4x12=48 cores

slide-20
SLIDE 20

LU Inversion's Power Profile: LAPACK

slide-21
SLIDE 21

LU Inversion's Power Profile: MKL

slide-22
SLIDE 22

LU Inversion's Power Profile: PLASMA

slide-23
SLIDE 23

This work was sponsored by NSF, DOE, and Microsoft LAPACK MKL PLASMA