A Case for Malleable Thread-Level Linear Algebra Libraries: The LU - - PowerPoint PPT Presentation

a case for malleable thread level linear algebra
SMART_READER_LITE
LIVE PREVIEW

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU - - PowerPoint PPT Presentation

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting Sandra Cataln, Jose R. Herrero, Enrique S. Quintana-Ort, Rafael Rodrguez-Snchez, Robert van de Geijn BLIS Retreat, 19-20th September


slide-1
SLIDE 1

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting

Sandra Catalán, Jose R. Herrero, Enrique S. Quintana-Ortí, Rafael Rodríguez-Sánchez, Robert van de Geijn

BLIS Retreat, 19-20th September 2016, Austin (Texas)

slide-2
SLIDE 2

Motivation

BLAS → TLP LAPACK → TP (runtime) Nested TLP + TP Increase number

  • f threads

2

slide-3
SLIDE 3

Why malleability

Ta 3th Tb 5th . . . . Ta . . . Tb

3

slide-4
SLIDE 4

Why malleability

Ta 3th Tb 5th . . . . Ta . . . Tb Ta 3th Tb 5th . . . . Ta . . Tb 8th

DLA library modification to allow number of threads expansion

3

slide-5
SLIDE 5

LU as an example

b size is important:

  • Too small → Low GEMM performance
  • Too large → Too many panel

factorization flops 4

slide-6
SLIDE 6

Optimal block size

5

slide-7
SLIDE 7

Optimal block size

6

slide-8
SLIDE 8

The panel factorization relevance

Less than 2% of the flops 17.5% of the time 7

slide-9
SLIDE 9

Dealing with the panel factorization

Look-ahead: Overlap the factorization

  • f the “next” panel with

the update of the “current” trailing submatrix.

8

slide-10
SLIDE 10

Look Ahead LU

9

slide-11
SLIDE 11

Our setup

  • Intel Xeon E5-2603 v3
  • 6 cores at 1.6 Ghz
  • BLIS 0.1.8
  • BLIS Loop 4 (jr) parallelized
  • Extrae 3.3.0
  • Panel factorization via blocked algorithm
  • Two block sizes bo and bi
  • Inner LU involve small-grained computations and little

parallelism

10

slide-12
SLIDE 12

Look Ahead LU Performance

11

slide-13
SLIDE 13

Look Ahead LU Performance

11

slide-14
SLIDE 14

Towards malleability

  • P threads in the panel factorization
  • R threads in the update
  • Panel factorization less expensive than update

– P threads will join R team eventually – BLAS does not allow to modify the number of

working threads

12

slide-15
SLIDE 15

Static re-partitioning

  • Workaround: split the update into several

GEMM

  • Drawbacks:

– Lower GEMM throughput (packing and suboptimal

blocks)

– Decision on which loop to parallelize and the

granularity of the partitioning

13

slide-16
SLIDE 16

Malleable thread-level BLAS

  • Solving static partitioning issues:

– Only one GEMM call → no extra data movements – BLIS takes care of the partitioning and granularity

14

slide-17
SLIDE 17

How Malleability behaves

15

slide-18
SLIDE 18

And the small case...

15

slide-19
SLIDE 19

What if panel factorization is more expensive than the update

  • If R finish before P → Stop panel factorization

– RL LU. Keep a copy of the panel – Use LL LU. Sincronization among threads follows

the same idea

16

slide-20
SLIDE 20

Look ahead via runtimes

✔ TP execution ✔ Adaptative-depth look-ahead ✗ Re-packing and data movements (many GEMM

calls)

✗ Block size fixes the granularity of the tasks ✗ Rarely exploit TP+TLP

17

slide-21
SLIDE 21

Experimental results

  • LU, LU_LA, LU_MB, LU_OS
  • Square matrices from n=500 to n=12,000
  • bo was tested for values from 32 to 512 in steps
  • f 32
  • bi was evaluated for 16 and 32

18

slide-22
SLIDE 22

Performance comparison

19

slide-23
SLIDE 23

Performance comparison

20

slide-24
SLIDE 24

Conclusions

  • Malleable implementation of DLA library
  • Competitive results (small matrices)
  • Pending strategies to be applied (Early

termination)

21

slide-25
SLIDE 25

THANK YOU