Communication-avoiding Cholesky-QR2 for rectangular matrices Edward - - PowerPoint PPT Presentation

communication avoiding cholesky qr2 for rectangular
SMART_READER_LITE
LIVE PREVIEW

Communication-avoiding Cholesky-QR2 for rectangular matrices Edward - - PowerPoint PPT Presentation

Communication-avoiding Cholesky-QR2 for rectangular matrices Edward Hutter and Edgar Solomonik Laboratory for Parallel Numerical Algorithms Department of Computer Science University of Illinois at Urbana-Champaign March 9, 2018 Edward Hutter


slide-1
SLIDE 1

Communication-avoiding Cholesky-QR2 for rectangular matrices

Edward Hutter and Edgar Solomonik

Laboratory for Parallel Numerical Algorithms Department of Computer Science University of Illinois at Urbana-Champaign

March 9, 2018

Edward Hutter Parallel 3D Cholesky-QR2

slide-2
SLIDE 2

Abstract

Novelty:

new distributed-memory QR factorization algorithm extends recent Cholesky-QR2 algorithm to rectangular matrices exploits a tunable processor grid to provably reduce communication utilizes first distributed-memory implementation of recursive 3D Cholesky factorization

Benefits

practical, flexible, and achieves minimal communication (data movement between processors)

Drawbacks

matrix must be sufficiently well-conditioned machine must have sufficient memory

Edward Hutter Parallel 3D Cholesky-QR2

slide-3
SLIDE 3

Motivation for reducing parallel algorithmic costs

Communication cost model: ↵ - cost of sending a message over a network

  • cost of injecting a byte of data into a network
  • cost of a floating point operation

Current cost trend for distributed-memory machines: ↵ Goal: A QR factorization algorithm that prioritizes minimizing synchronization and communication cost

Edward Hutter Parallel 3D Cholesky-QR2

slide-4
SLIDE 4

2D and 3D QR algorithm characteristics

2D algorithms are communication-optimal assuming minimal memory footprint 3D algorithms take advantage of extra memory to reduce communication

exist in theory but have not been implemented or studied in practice.1 2

  • 1A. Tiskin 2007, ”Communication-efficient generic pairwise elimination”
  • 2E. Solomonik et al., ”A communication-avoiding parallel algorithm for the symmetric eigenvalue problem”

Edward Hutter Parallel 3D Cholesky-QR2

slide-5
SLIDE 5

Parallel QR factorization cost comparison

Scalapack PGEQRF method is a 2D Householder QR CA-QR utilizes TSQR along panels to reduce synchronization Communication-avoiding CholeskyQR2 is a tunable Cholesky-based QR factorization algorithm We will compare both the theoretical cost and performance T2D Householder QR = O ✓ n log P · ↵ + mn p P · ◆ TCAQR-HR = O ✓p P log2 P · ↵ + mn p P · ◆ TCA-CholeskyQR2 = O ✓Pn m ◆ 2

3

log P · ↵ + ✓n2m P ◆ 2

3

· !

Edward Hutter Parallel 3D Cholesky-QR2

slide-6
SLIDE 6

Cholesky-QR algorithm

[Q, R] CholeskyQR (A)

B ATA R Cholesky(B) Q AR−1 ATA = RTQTQR = RTR if Q is orthogonal Both Cholesky and triangular solve are backwards stable, yet sensitive to conditioning of B Cholesky can break down due to numerical error, losing positive-definiteness CholeskyQR is not stable, deviation from orthogonality of computed Q is O( (A)2 · ✏), where ✏ is machine epsilon

Edward Hutter Parallel 3D Cholesky-QR2

slide-7
SLIDE 7

Cholesky-QR2 algorithm

[Q, R] CholeskyQR2 (A)

Q1, R1 CholeskyQR (A) Q, R2 CholeskyQR (Q1) R R2R1 By performing CholeskyQR 2x, the residual and deviation from orthogonality are O (✏) if  (A) = O ⇣

1 √✏

3

Proposed as a replacement for TSQR for tall-and-skinny matrices

Lower theoretical communication cost by O(log P), better performance, and simpler implementation4 TSQR is unconditionably stable

  • 3Y. Yamamoto et al., ”Roundoff Error Analysis of the CholeskyQR2 ..”
  • 4Y. Yamamoto et al., ”CholeskyQR2: A communication-avoiding algorithm”

Edward Hutter Parallel 3D Cholesky-QR2

slide-8
SLIDE 8

1D CholeskyQR2

Edward Hutter Parallel 3D Cholesky-QR2

slide-9
SLIDE 9

1D CholeskyQR2

TCholeskyQR2 1D (m, n, P) = O log P · α + n2 · β + n2m P + n3 ! · γ ! Edward Hutter Parallel 3D Cholesky-QR2

slide-10
SLIDE 10

Figure: 3D algorithm for square matrix multiplication 1 2 3

1Bersten 1989, ”Communication-efficient matrix multiplication on hypercubes” 2Aggarwal, Chandra, Snir 1990, ”Communication complexity of PRAMs” 3Agarwal et al. 1995, ”A three-dimensional approach to parallel matrix multiplication” Edward Hutter Parallel 3D Cholesky-QR2

slide-11
SLIDE 11

Figure: 3D algorithm for square matrix multiplication 1 2 3

T3D MM (n, P) = O ✓ log P · ↵ + n2 P

2 3

· + n3 P · ◆

1Bersten 1989, ”Communication-efficient matrix multiplication on hypercubes” 2Aggarwal, Chandra, Snir 1990, ”Communication complexity of PRAMs” 3Agarwal et al. 1995, ”A three-dimensional approach to parallel matrix multiplication” Edward Hutter Parallel 3D Cholesky-QR2

slide-12
SLIDE 12

3D recursive CholeskyInverse

[L, L−1] CholeskyInverse (A)

h L11 L−1

11

i ← CholeskyInverse(A11) L21 ← A21L−T

11

h L22 L−1

22

i ← CholeskyInverse(A22 − L21LT

21)

L−1

21

← −L−1

22 L21L−1 11

TCholeskyInverse3D (n, P) = 2T ↵− ⇣n 2, P ⌘ + O (3D Matrix Multiplication)

Edward Hutter Parallel 3D Cholesky-QR2

slide-13
SLIDE 13

3D recursive CholeskyInverse

[L, L−1] CholeskyInverse (A)

h L11 L−1

11

i ← CholeskyInverse(A11) L21 ← A21L−T

11

h L22 L−1

22

i ← CholeskyInverse(A22 − L21LT

21)

L−1

21

← −L−1

22 L21L−1 11

TCholeskyInverse3D (n, P) = 2T ↵− ⇣n 2, P ⌘ + O (3D Matrix Multiplication) TCholeskyInverse3D (n, P) = O ✓ P

2 3 log P · ↵ + n2

P

2 3

· + n3 P · ◆

Edward Hutter Parallel 3D Cholesky-QR2

slide-14
SLIDE 14

Triangular solve with multiple right hand sides (TRSM)

TRSM3D is an option to reduce computation cost by 2 at the highest levels of recursion Skip two matrix multiplications by not obtaining L−1

21

Utilize diagonally inverted blocks and iterate along column panels Two 3D Matrix Multiplications per iteration to solve for panel and update trailing matrix

Edward Hutter Parallel 3D Cholesky-QR2

slide-15
SLIDE 15

Figure: ATA over a tunable c ⇥ d ⇥ c processor grid

Edward Hutter Parallel 3D Cholesky-QR2

slide-16
SLIDE 16

Figure: Start with a tunable c ⇥ d ⇥ c processor grid

Edward Hutter Parallel 3D Cholesky-QR2

slide-17
SLIDE 17

Figure: Broadcast columns of A

Cost: 2 log2 c · ↵ + 2mn

dc ·

Edward Hutter Parallel 3D Cholesky-QR2

slide-18
SLIDE 18

Figure: Reduce contiguous groups of size c

Cost: 2 log2 c · ↵ + 2n2

c2 · + n2 c2 ·

Edward Hutter Parallel 3D Cholesky-QR2

slide-19
SLIDE 19

Figure: Allreduce alternating groups of size d

c

Cost: 2 log2

d c · ↵ + 2n2 c2 · + n2 c2 ·

Edward Hutter Parallel 3D Cholesky-QR2

slide-20
SLIDE 20

Figure: Broadcast missing pieces of B along depth

Cost: 2 log2 c · ↵ + 2n2

c2 ·

Edward Hutter Parallel 3D Cholesky-QR2

slide-21
SLIDE 21

Figure: d

c simultaneous 3D CholeskyInverse on cubes of dimension c

Cost: O ⇣ c2 log c3 · ↵ + n2

c2 · + n3 c3 ·

Edward Hutter Parallel 3D Cholesky-QR2

slide-22
SLIDE 22

Figure: d

c simultaneous 3D matrix multiplication or TRSM on cubes of

dimension c

Cost: O(2 log2 c3 · ↵ + ⇣

4mn dc + n2+nc c2

⌘ · + n2m

c2d · )

Edward Hutter Parallel 3D Cholesky-QR2

slide-23
SLIDE 23

Figure: Tunable Cholesky-QR2 Algorithm

Edward Hutter Parallel 3D Cholesky-QR2

slide-24
SLIDE 24

Cost comparison

By setting m

d = n c and enforcing P = c2d, we are able to solve

for the optimal grid and minimize communication

1D CholeskyQR2 2D Householder QR 2D CAQR-HR

  • ptimal

# of messages O (log P) n log P √ P log P O ⇣

Pn m

⌘ 2

3 log P

! # of words O ⇣ n2⌘ O( mn

√ P )

O( mn

√ P )

O ✓

n2m P

◆ 2

3

! # of flops O ✓

n2m P

+ n3 ◆ O( mn2

P

) O( mn2

P

) O ✓

n2m P

◆ Memory footprint O ⇣

mn P

+ n2⌘ O( mn

P )

O( mn

P )

O ✓

n2m P

◆ 2

3

! Edward Hutter Parallel 3D Cholesky-QR2

slide-25
SLIDE 25

Implementation and performance testing

All code written from scratch in modern C++ and MPI This is a first implementation

TRSM not used in results No overlap in computation and communication No topology-aware mapping No threading except for multi-threaded BLAS

For each processor count and matrix size:

Scalapack PGEQRF was tuned over block sizes and processor grid dimensions CA-CholeskyQR2 was tuned over range of 3D grid dimensions

Best performance numbers were chosen to compare

Edward Hutter Parallel 3D Cholesky-QR2

slide-26
SLIDE 26

Architecture details

Cray XC40 system at Argonne Leadership Computing Facility

Compute nodes: Intel Knights Landing

Each node is a single Xeon Phi chip with 64 cores, 16 GB MCDRAM, 192GB DDR4 Each core supports 4 hardware threads, up to 256 threads per node

Interconnect topology: Cray Aries dual-place Dragonfly with 10 groups

Edward Hutter Parallel 3D Cholesky-QR2

slide-27
SLIDE 27

Architecture details

Cray XC40 system at Argonne Leadership Computing Facility

Compute nodes: Intel Knights Landing

Each node is a single Xeon Phi chip with 64 cores, 16 GB MCDRAM, 192GB DDR4 Each core supports 4 hardware threads, up to 256 threads per node

Interconnect topology: Cray Aries dual-place Dragonfly with 10 groups

Our configuration:

16 MPI processes per node, 4 OpenMP threads per process, 2 hyperthreads per core MKL 2018 libraries used for BLAS and LAPACK MKL 2018 SCALAPACK library used as comparison for benchmarking (d, c) pairs for c ⇥ d ⇥ c processor grid given for each data point

Edward Hutter Parallel 3D Cholesky-QR2

slide-28
SLIDE 28

8 16 32 64 128 256 128 256 512 1024 Gigaflops/s #nodes (16 processes/node) Weak scaling on Theta (XC40)

Cholesky-QR2, m=#nodes*256, n=128 ScaLAPACK QR, m=#nodes*256, n=128

Edward Hutter Parallel 3D Cholesky-QR2

(512, 2) (1024, 2) (2048, 2) (4096, 2)

slide-29
SLIDE 29

16 32 64 128 256 512 1024 128 256 512 1024 Gigaflops/s #nodes (16 processes/node) Weak scaling on Theta (XC40)

Cholesky-QR2, m=#nodes*128, n=512 ScaLAPACK QR, m=#nodes*128, n=512

Edward Hutter Parallel 3D Cholesky-QR2

(128, 4) (256, 4) (512, 4) (1024, 4)

slide-30
SLIDE 30

8 16 32 64 128 256 128 256 512 1024 Gigaflops/s #nodes (16 processes/node) Weak scaling on Theta (XC40)

Cholesky-QR2, m=#nodes*512, n=1024 ScaLAPACK QR, m=#nodes*512, n=1024

Edward Hutter Parallel 3D Cholesky-QR2

(32, 8) (64, 8) (128, 8) (256, 8)

slide-31
SLIDE 31

4 8 16 32 64 128 128 256 512 1024 Gigaflops/s #nodes (16 processes/node) Strong scaling on Theta (XC40)

Cholesky-QR2, m=16384, n=256 ScaLAPACK QR, m=16384, n=256

Edward Hutter Parallel 3D Cholesky-QR2

(128, 4) (256, 4) (512, 4) (1024, 4)

slide-32
SLIDE 32

Performance analysis

CholeskyQR2 is outperforming Scalapack on rectangular matrices Scalapack is outperforming CholeskyQR2 on tall-and-skinny and near-square matrices Both do not show strong scaling to 1024 nodes Flop cost is a concern

HouseHolder QR performs 2mn2 2n3

3

flops CholeskyQR2 performs 4mn2 + 5n3

3

flops

Communication improvement of O(P

1 6 )

Perhaps P is not large enough to make a difference?

Edward Hutter Parallel 3D Cholesky-QR2

slide-33
SLIDE 33

Future work

Numerical library integration is our focus. Much work remains

  • ptimize and tune the code, more large scale numerical tests

to better understand performance evaluate more tunable grid shapes to further analyze patterns in performance develop memory tunable variants for machines without enough memory reduce the flop cost and improve the stability MS87 talk at 3:45 on fixing the stability of CholeskyQR2 Acknowledgement: Argonne Leadership Computing Facility

Edward Hutter Parallel 3D Cholesky-QR2

slide-34
SLIDE 34

Optimum cost of CholesyQR2 Tunable

The advantage of using a tunable grid lies in the ability to frame the shape of the grid around the shape of rectangular m × n matrix A. Optimal communication can be attained by ensuring that the grid perfectly fits the dimensions of A, or that the dimensions of the grid are proportional to the dimensions of the matrix. We derive the cost for the optimal ratio m

d = n c below. Using equation P = c2d and m d = n c , solve for d, c in terms of m, n, P.

Solving the system of equations yields c = ⇣

Pn m

⌘ 1

3 , d =

Pm2 n2

◆ 1

3 . We can plug these values into the cost of

CholeskyQR2 Tunable to find the optimal cost. T α−β

CholeskyQR2 Tunable

@m, n, ✓ Pn m ◆ 1

3 ,

Pm2 n2 ! 1

3

1 A = O ✓ Pn m ◆ 2

3 log P · α

+ ⇣

Pn m

⌘ 1

3 mn + n2

Pm2 n2

◆ 1

3

Pm2 n2

⌘ 1

3 ⇣ Pn m

⌘ 2

3

· β + n3 ✓

Pm2 n2

◆ 1

3 + n2m

Pn m

⌘ 1

3

Pn m

⌘ ⇣

Pm2 n2

⌘ 1

3

· γ ! = O ✓ Pn m ◆ 2

3 log P · α +

n2m P ! 2

3

· β + n2m P · γ ! (1) Grid shape Metric Cost

  • ptimal

# of messages O ⇣

Pn m

⌘ 2

3 log P

! # of words O ✓

n2m P

◆ 2

3

! # of flops O ✓

n2m P

◆ Memory footprint O ✓

n2m P

◆ 2

3

! Edward Hutter Parallel 3D Cholesky-QR2