Computing the Rank profile matrix (and some bonus) Joun ees - - PowerPoint PPT Presentation

computing the rank profile matrix and some bonus
SMART_READER_LITE
LIVE PREVIEW

Computing the Rank profile matrix (and some bonus) Joun ees - - PowerPoint PPT Presentation

Computing the Rank profile matrix (and some bonus) Joun ees Nationales du Calcul Formel Cl ement Pernet Laboratoire de lInformatique du Parall elisme, Univ. Grenoble Alpes, Univ. de Lyon, CNRS, Inria Cluny. 2 novembre 2015 C.


slide-1
SLIDE 1

Computing the Rank profile matrix (and some bonus)

Joun´ ees Nationales du Calcul Formel Cl´ ement Pernet

Laboratoire de l’Informatique du Parall´ elisme,

  • Univ. Grenoble Alpes, Univ. de Lyon, CNRS, Inria

Cluny. 2 novembre 2015

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 1 / 51

slide-2
SLIDE 2

Trailer

Which CPU arithmetic to multiply 2000 × 2000 matrices over 200bit integers ?

1 boolean 2 int32 t 3 int64 t 4 float 5 double 6 GMP mpz t (hence uint64 t)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 2 / 51

slide-3
SLIDE 3

Trailer

Which CPU arithmetic to multiply 2000 × 2000 matrices over 200bit integers ?

1 boolean 2 int32 t 3 int64 t 4 float 5 double 6 GMP mpz t (hence uint64 t)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 2 / 51

slide-4
SLIDE 4

Trailer

Which CPU arithmetic to multiply 2000 × 2000 matrices over 200bit integers ?

1 boolean 2 int32 t 3 int64 t 4 float 5 double 6 GMP mpz t (hence uint64 t)

Rank profiles: how to select the first 3 linearly indep rows of

1 1 1 1 2 1 1 2 2 2 0 0 0 0 1 2 1 1

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 2 / 51

slide-5
SLIDE 5

Trailer

Which CPU arithmetic to multiply 2000 × 2000 matrices over 200bit integers ?

1 boolean 2 int32 t 3 int64 t 4 float 5 double 6 GMP mpz t (hence uint64 t)

Rank profiles: how to select the first 3 linearly indep rows of

1 1 1 1 2 1 1 2 2 2 0 0 0 0 1 2 1 1

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 2 / 51

slide-6
SLIDE 6

Trailer

Which CPU arithmetic to multiply 2000 × 2000 matrices over 200bit integers ?

1 boolean 2 int32 t 3 int64 t 4 float 5 double 6 GMP mpz t (hence uint64 t)

Rank profiles: how to select the first 3 linearly indep rows of

1 1 1 1 2 1 1 2 2 2 0 0 0 0 1 1 1 2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 2 / 51

slide-7
SLIDE 7

Trailer

Which CPU arithmetic to multiply 2000 × 2000 matrices over 200bit integers ?

1 boolean 2 int32 t 3 int64 t 4 float 5 double 6 GMP mpz t (hence uint64 t)

Rank profiles: how to select the first 3 linearly indep rows of

1 1 1 1 2 1 1 2 2 2 0 0 0 1 1 1 2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 2 / 51

slide-8
SLIDE 8

Outline

1

Choosing the underlying arithmetic Using machine word arithmetic Larger field sizes

2

Reductions and building blocks

3

Gaussian elimination Which reduction Computing rank profiles Algorithmic instances Relation to other decompositions The small rank case

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 3 / 51

slide-9
SLIDE 9

Choosing the underlying arithmetic

Outline

1

Choosing the underlying arithmetic Using machine word arithmetic Larger field sizes

2

Reductions and building blocks

3

Gaussian elimination Which reduction Computing rank profiles Algorithmic instances Relation to other decompositions The small rank case

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 4 / 51

slide-10
SLIDE 10

Choosing the underlying arithmetic

Most common operation

Most of dense linear algebra operations boil down to (a lot of) y ← y ± a * b

◮ dot-product ◮ matrix-matrix multiplication ◮ rank 1 update in Gaussian elimination ◮ Schur complements, . . .

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 5 / 51

slide-11
SLIDE 11

Choosing the underlying arithmetic

Which computer arithmetic ?

Many base fields/rings to support

Z2 1 bit Z3,5,7 2-3 bits Zp 4-26 bits Z, Q > 32 bits Zp > 32 bits

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 6 / 51

slide-12
SLIDE 12

Choosing the underlying arithmetic

Which computer arithmetic ?

Many base fields/rings to support

Z2 1 bit Z3,5,7 2-3 bits Zp 4-26 bits Z, Q > 32 bits Zp > 32 bits

Available CPU arithmetic

◮ boolean ◮ integer (fixed size) ◮ floating point ◮ .. and their vectorization

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 6 / 51

slide-13
SLIDE 13

Choosing the underlying arithmetic

Which computer arithmetic ?

Many base fields/rings to support

Z2 1 bit bit-packing Z3,5,7 2-3 bits bit-slicing, bit-packing Zp 4-26 bits CPU arithmetic Z, Q > 32 bits

  • multiprec. ints, big ints,CRT, lifting

Zp > 32 bits

  • multiprec. ints, big ints, CRT

Available CPU arithmetic

◮ boolean ◮ integer (fixed size) ◮ floating point ◮ .. and their vectorization

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 6 / 51

slide-14
SLIDE 14

Choosing the underlying arithmetic Using machine word arithmetic

Dense linear algebra over Zp for word-size p

Delayed modular reductions

1 Compute using integer arithmetic 2 Reduce modulo p only when necessary

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 7 / 51

slide-15
SLIDE 15

Choosing the underlying arithmetic Using machine word arithmetic

Dense linear algebra over Zp for word-size p

Delayed modular reductions

1 Compute using integer arithmetic 2 Reduce modulo p only when necessary

When to reduce ? Bound the values of all intermediate computations.

◮ A priori:

Representation of Zp {0 . . . p − 1} {− p−1

2

. . . p−1

2 }

Scalar product, Classic MatMul n(p − 1)2 n p−1

2

2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 7 / 51

slide-16
SLIDE 16

Choosing the underlying arithmetic Using machine word arithmetic

Dense linear algebra over Zp for word-size p

Delayed modular reductions

1 Compute using integer arithmetic 2 Reduce modulo p only when necessary

When to reduce ? Bound the values of all intermediate computations.

◮ A priori:

Representation of Zp {0 . . . p − 1} {− p−1

2

. . . p−1

2 }

Scalar product, Classic MatMul n(p − 1)2 n p−1

2

2 Strassen-Winograd MatMul (ℓ rec. levels) ( 1+3ℓ

2

)2⌊ n

2ℓ ⌋ (p − 1)2

9ℓ⌊ n

2ℓ ⌋

p−1

2

2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 7 / 51

slide-17
SLIDE 17

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

How to compute with (machine word size) integers efficiently?

1 use CPU’s integer arithmetic units

y += a * b: correct if |ab + y| < 263 |a|, |b| < 231

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 8 / 51

slide-18
SLIDE 18

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

How to compute with (machine word size) integers efficiently?

1 use CPU’s integer arithmetic units

y += a * b: correct if |ab + y| < 263 |a|, |b| < 231

movq (%rax,%rdx,8), %rax imulq

  • 56(%rbp), %rax

addq %rax, %rcx movq

  • 80(%rbp), %rax
  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 8 / 51

slide-19
SLIDE 19

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

How to compute with (machine word size) integers efficiently?

1 use CPU’s integer arithmetic units + vectorization

y += a * b: correct if |ab + y| < 263 |a|, |b| < 231

movq (%rax,%rdx,8), %rax imulq

  • 56(%rbp), %rax

addq %rax, %rcx movq

  • 80(%rbp), %rax

vpmuludq %xmm3, %xmm0,%xmm0 vpaddq %xmm2,%xmm0,%xmm0 vpsllq $32,%xmm0,%xmm0

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 8 / 51

slide-20
SLIDE 20

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

How to compute with (machine word size) integers efficiently?

1 use CPU’s integer arithmetic units + vectorization

y += a * b: correct if |ab + y| < 263 |a|, |b| < 231

movq (%rax,%rdx,8), %rax imulq

  • 56(%rbp), %rax

addq %rax, %rcx movq

  • 80(%rbp), %rax

vpmuludq %xmm3, %xmm0,%xmm0 vpaddq %xmm2,%xmm0,%xmm0 vpsllq $32,%xmm0,%xmm0

2 use CPU’s floating point units

y += a * b: correct if |ab + y| < 253 |a|, |b| < 226

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 8 / 51

slide-21
SLIDE 21

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

How to compute with (machine word size) integers efficiently?

1 use CPU’s integer arithmetic units + vectorization

y += a * b: correct if |ab + y| < 263 |a|, |b| < 231

movq (%rax,%rdx,8), %rax imulq

  • 56(%rbp), %rax

addq %rax, %rcx movq

  • 80(%rbp), %rax

vpmuludq %xmm3, %xmm0,%xmm0 vpaddq %xmm2,%xmm0,%xmm0 vpsllq $32,%xmm0,%xmm0

2 use CPU’s floating point units

y += a * b: correct if |ab + y| < 253 |a|, |b| < 226

movsd (%rax,%rdx,8), %xmm0 mulsd

  • 56(%rbp), %xmm0

addsd %xmm0, %xmm1 movq %xmm1, %rax

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 8 / 51

slide-22
SLIDE 22

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

How to compute with (machine word size) integers efficiently?

1 use CPU’s integer arithmetic units + vectorization

y += a * b: correct if |ab + y| < 263 |a|, |b| < 231

movq (%rax,%rdx,8), %rax imulq

  • 56(%rbp), %rax

addq %rax, %rcx movq

  • 80(%rbp), %rax

vpmuludq %xmm3, %xmm0,%xmm0 vpaddq %xmm2,%xmm0,%xmm0 vpsllq $32,%xmm0,%xmm0

2 use CPU’s floating point units + vectorization

y += a * b: correct if |ab + y| < 253 |a|, |b| < 226

movsd (%rax,%rdx,8), %xmm0 mulsd

  • 56(%rbp), %xmm0

addsd %xmm0, %xmm1 movq %xmm1, %rax vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, vmulpd %ymm1, %ymm0, %ymm0 vaddpd (%rsi,%rax),%ymm0, %ymm0 vmovapd %ymm0, (%rsi,%rax)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 8 / 51

slide-23
SLIDE 23

Choosing the underlying arithmetic Using machine word arithmetic

Exploiting in-core parallelism

Ingredients SIMD: Single Instruction Multiple Data: 1 arith. unit acting on a vector of data

MMX 64 bits SSE 128bits AVX 256 bits AVX-512 512 bits

4 x 64 = 256 bits

+

y[3] x[3] x[2] y[2] y[1] x[1] x[0] y[0] x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3]

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 9 / 51

slide-24
SLIDE 24

Choosing the underlying arithmetic Using machine word arithmetic

Exploiting in-core parallelism

Ingredients SIMD: Single Instruction Multiple Data: 1 arith. unit acting on a vector of data

MMX 64 bits SSE 128bits AVX 256 bits AVX-512 512 bits

4 x 64 = 256 bits

+

y[3] x[3] x[2] y[2] y[1] x[1] x[0] y[0] x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3]

Pipeline: amortize the latency of an operation when used repeatedly

throughput of 1 op/ Cycle for all arithmetic ops considered here

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 9 / 51

slide-25
SLIDE 25

Choosing the underlying arithmetic Using machine word arithmetic

Exploiting in-core parallelism

Ingredients SIMD: Single Instruction Multiple Data: 1 arith. unit acting on a vector of data

MMX 64 bits SSE 128bits AVX 256 bits AVX-512 512 bits

4 x 64 = 256 bits

+

y[3] x[3] x[2] y[2] y[1] x[1] x[0] y[0] x[0]+y[0] x[1]+y[1] x[2]+y[2] x[3]+y[3]

Pipeline: amortize the latency of an operation when used repeatedly

throughput of 1 op/ Cycle for all arithmetic ops considered here

Execution Unit parallelism: multiple arith. units acting simulatneously on distinct registers

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 9 / 51

slide-26
SLIDE 26

Choosing the underlying arithmetic Using machine word arithmetic

SIMD and vectorization

Intel Sandybridge micro-architecture

FMUL

Port 0

FADD

Port 1 Port 5 4 x 64 = 256 bits Scheduler 2 x 64 = 128 bits

Int ADD Int MUL Int ADD

Performs at every clock cycle:

◮ 1 Floating Pt. Mul

× 4

◮ 1 Floating Pt. Add

× 4 Or:

◮ 1 Integer Mul

× 2

◮ 2 Integer Add

× 2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 10 / 51

slide-27
SLIDE 27

Choosing the underlying arithmetic Using machine word arithmetic

SIMD and vectorization

Intel Haswell micro-architecture

FMA Int MUL

Port 0

FMA Int ADD

Port 1

Int ADD

Port 5 4 x 64 = 256 bits Scheduler

Performs at every clock cycle:

◮ 2 Floating Pt. Mul & Add × 4

Or:

◮ 1 Integer Mul

× 4

◮ 2 Integer Add

× 4 FMA: Fused Multiplying & Accumulate, c += a * b

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 11 / 51

slide-28
SLIDE 28

Choosing the underlying arithmetic Using machine word arithmetic

SIMD and vectorization

AMD Bulldozer micro-architecture

Port 0 Port 1 Port 3 Scheduler 2 x 64 = 128 bits

Int MUL Int ADD FMA FMA Int ADD

Port 2 2 x 64 = 128 bits

Performs at every clock cycle:

◮ 2 Floating Pt. Mul & Add × 2

Or:

◮ 1 Integer Mul

× 2

◮ 2 Integer Add

× 2 FMA: Fused Multiplying & Accumulate, c += a * b

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 12 / 51

slide-29
SLIDE 29

Choosing the underlying arithmetic Using machine word arithmetic

SIMD and vectorization

Intel Nehalem micro-architecture

Port 0 Port 1 Port 5 Scheduler 2 x 64 = 128 bits

Int MUL Int ADD FADD FMUL

2 x 64 = 128 bits

Int ADD

Performs at every clock cycle:

◮ 1 Floating Pt. Mul

× 2

◮ 1 Floating Pt. Add

× 2 Or:

◮ 1 Integer Mul

× 2

◮ 2 Integer Add

× 2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 13 / 51

slide-30
SLIDE 30

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT 256 2 1 4 3.5 28 AVX2 FP 256 2 8 3.5 56 Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq × ( # daxpy / Cycle) ×2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 14 / 51

slide-31
SLIDE 31

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT 256 2 1 4 3.5 28 23.3 AVX2 FP 256 2 8 3.5 56 49.2 Intel Sandybridge INT AVX1 FP AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq × ( # daxpy / Cycle) ×2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 14 / 51

slide-32
SLIDE 32

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT 256 2 1 4 3.5 28 23.3 AVX2 FP 256 2 8 3.5 56 49.2 Intel Sandybridge INT 128 2 1 2 3.3 13.2 AVX1 FP 256 1 1 4 3.3 26.4 AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq × ( # daxpy / Cycle) ×2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 14 / 51

slide-33
SLIDE 33

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT 256 2 1 4 3.5 28 23.3 AVX2 FP 256 2 8 3.5 56 49.2 Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1 AVX1 FP 256 1 1 4 3.3 26.4 24.6 AMD Bulldozer INT FMA4 FP Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq × ( # daxpy / Cycle) ×2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 14 / 51

slide-34
SLIDE 34

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT 256 2 1 4 3.5 28 23.3 AVX2 FP 256 2 8 3.5 56 49.2 Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1 AVX1 FP 256 1 1 4 3.3 26.4 24.6 AMD Bulldozer INT 128 2 1 2 2.1 8.4 FMA4 FP 128 2 4 2.1 16.8 Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq × ( # daxpy / Cycle) ×2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 14 / 51

slide-35
SLIDE 35

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT 256 2 1 4 3.5 28 23.3 AVX2 FP 256 2 8 3.5 56 49.2 Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1 AVX1 FP 256 1 1 4 3.3 26.4 24.6 AMD Bulldozer INT 128 2 1 2 2.1 8.4 6.44 FMA4 FP 128 2 4 2.1 16.8 13.1 Intel Nehalem INT SSE4 FP AMD K10 INT SSE4a FP Speed of light: CPU freq × ( # daxpy / Cycle) ×2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 14 / 51

slide-36
SLIDE 36

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT 256 2 1 4 3.5 28 23.3 AVX2 FP 256 2 8 3.5 56 49.2 Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1 AVX1 FP 256 1 1 4 3.3 26.4 24.6 AMD Bulldozer INT 128 2 1 2 2.1 8.4 6.44 FMA4 FP 128 2 4 2.1 16.8 13.1 Intel Nehalem INT 128 2 1 2 2.66 10.6 SSE4 FP 128 1 1 2 2.66 10.6 AMD K10 INT SSE4a FP Speed of light: CPU freq × ( # daxpy / Cycle) ×2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 14 / 51

slide-37
SLIDE 37

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT 256 2 1 4 3.5 28 23.3 AVX2 FP 256 2 8 3.5 56 49.2 Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1 AVX1 FP 256 1 1 4 3.3 26.4 24.6 AMD Bulldozer INT 128 2 1 2 2.1 8.4 6.44 FMA4 FP 128 2 4 2.1 16.8 13.1 Intel Nehalem INT 128 2 1 2 2.66 10.6 4.47 SSE4 FP 128 1 1 2 2.66 10.6 9.6 AMD K10 INT SSE4a FP Speed of light: CPU freq × ( # daxpy / Cycle) ×2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 14 / 51

slide-38
SLIDE 38

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT 256 2 1 4 3.5 28 23.3 AVX2 FP 256 2 8 3.5 56 49.2 Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1 AVX1 FP 256 1 1 4 3.3 26.4 24.6 AMD Bulldozer INT 128 2 1 2 2.1 8.4 6.44 FMA4 FP 128 2 4 2.1 16.8 13.1 Intel Nehalem INT 128 2 1 2 2.66 10.6 4.47 SSE4 FP 128 1 1 2 2.66 10.6 9.6 AMD K10 INT 64 2 1 1 2.4 4.8 SSE4a FP 128 1 1 2 2.4 9.6 Speed of light: CPU freq × ( # daxpy / Cycle) ×2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 14 / 51

slide-39
SLIDE 39

Choosing the underlying arithmetic Using machine word arithmetic

Summary: 64 bits AXPY throughput

Register size # Adders # Multipliers # FMA # daxpy /Cycle CPU Freq. (Ghz) Speed of Light (Gfops) Speed in practice (Gfops) Intel Haswell INT 256 2 1 4 3.5 28 23.3 AVX2 FP 256 2 8 3.5 56 49.2 Intel Sandybridge INT 128 2 1 2 3.3 13.2 12.1 AVX1 FP 256 1 1 4 3.3 26.4 24.6 AMD Bulldozer INT 128 2 1 2 2.1 8.4 6.44 FMA4 FP 128 2 4 2.1 16.8 13.1 Intel Nehalem INT 128 2 1 2 2.66 10.6 4.47 SSE4 FP 128 1 1 2 2.66 10.6 9.6 AMD K10 INT 64 2 1 1 2.4 4.8 SSE4a FP 128 1 1 2 2.4 9.6 8.93 Speed of light: CPU freq × ( # daxpy / Cycle) ×2

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 14 / 51

slide-40
SLIDE 40

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers: ressources

Micro-architecture bible: Agner Fog’s software optimization resources [www.agner.org/optimize] Experiments: dgemm (double): OpenBLAS [http://www.openblas.net/] igemm (int64 t): Eigen [http://eigen.tuxfamily.org/] & FFLAS-FFPACK [linalg.org/projects/fflas-ffpack]

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 15 / 51

slide-41
SLIDE 41

Choosing the underlying arithmetic Using machine word arithmetic

Looking into the near future

Intel Skylake & Knights Landing: AVX512-F 2016 (2017 on Xeons)

◮ Enlarge SIMD register width to 512 bits (8 double or int64 t) ◮ same micro arch : FMA for FP and seprate mul/add for INT.

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 16 / 51

slide-42
SLIDE 42

Choosing the underlying arithmetic Using machine word arithmetic

Looking into the near future

Intel Skylake & Knights Landing: AVX512-F 2016 (2017 on Xeons)

◮ Enlarge SIMD register width to 512 bits (8 double or int64 t) ◮ same micro arch : FMA for FP and seprate mul/add for INT.

Cannonlake: AVX512-IFMA >2017

◮ AVX512 extension: IFMA (Integer FMA): y += a*b on int64 t ◮ But limited to the lower 52 bits of the output (uses the FP FMA)

no advantage for int64 t over double

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 16 / 51

slide-43
SLIDE 43

Choosing the underlying arithmetic Using machine word arithmetic

Integer Packing

32 bits: half the precision twice the speed

double

float float float float float float float float

double double double

Gfops double float int64 t int32 t Intel SandyBridge 24.7 49.1 12.1 24.7 Intel Haswell 49.2 77.6 23.3 27.4 AMD Bulldozer 13.0 20.7 6.63 11.8

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 17 / 51

slide-44
SLIDE 44

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 Speed (Gfops) Modulo p (bitsize) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p

SandyBridge i5-3320M@3.3Ghz. n = 2000. Take home message

◮ Floating pt. arith. delivers the highest speed (except in corner cases) ◮ 32 bits twice as fast as 64 bits

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 18 / 51

slide-45
SLIDE 45

Choosing the underlying arithmetic Using machine word arithmetic

Computing over fixed size integers

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 Speed (Gfops) Modulo p (bitsize) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p 100 200 300 400 500 600 5 10 15 20 25 30 35 Bit speed (Gbops) Modulo p (bitsize) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p

SandyBridge i5-3320M@3.3Ghz. n = 2000. Take home message

◮ Floating pt. arith. delivers the highest speed (except in corner cases) ◮ 32 bits twice as fast as 64 bits ◮ best bit computation throughput for double precision floating points.

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 18 / 51

slide-46
SLIDE 46

Choosing the underlying arithmetic Larger field sizes

Larger finite fields: log2 p ≥ 32

As before:

1 Use adequate integer arithmetic 2 reduce modulo p only when necessary

Which integer arithmetic?

1 big integers (GMP) 2 fixed size multiprecision (Givaro-RecInt) 3 Residue Number Systems (Chinese Remainder theorem)

using moduli delivering optimum bitspeed

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 19 / 51

slide-47
SLIDE 47

Choosing the underlying arithmetic Larger field sizes

Larger finite fields: log2 p ≥ 32

As before:

1 Use adequate integer arithmetic 2 reduce modulo p only when necessary

Which integer arithmetic?

1 big integers (GMP) 2 fixed size multiprecision (Givaro-RecInt) 3 Residue Number Systems (Chinese Remainder theorem)

using moduli delivering optimum bitspeed log2 p 50 100 150 GMP 58.1s 94.6s 140s RecInt 5.7s 28.6s 837s RNS 0.785s 1.42s 1.88s n = 1000, on an Intel SandyBridge.

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 19 / 51

slide-48
SLIDE 48

Choosing the underlying arithmetic Larger field sizes

In practice

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 Speed (Gfops) Modulo p (bitsize) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p 100 200 300 400 500 600 5 10 15 20 25 30 35 Bit speed (Gbops) Modulo p (bitsize) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 20 / 51

slide-49
SLIDE 49

Choosing the underlying arithmetic Larger field sizes

In practice

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 Speed (Gfops) Modulo p (bitsize) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p 100 200 300 400 500 600 5 10 15 20 25 30 35 Bit speed (Gbops) Modulo p (bitsize) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p 100 200 300 400 500 600 20 40 60 80 100 120 140 Bit speed (Gbops) Modulo p (bitsize) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 20 / 51

slide-50
SLIDE 50

Choosing the underlying arithmetic Larger field sizes

In practice

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 Speed (Gfops) Modulo p (bitsize) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p 100 200 300 400 500 600 5 10 15 20 25 30 35 Bit speed (Gbops) Modulo p (bitsize) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p 100 200 300 400 500 600 20 40 60 80 100 120 140 Bit speed (Gbops) Modulo p (bitsize) fgemm C = A x B n = 2000 float mod p double mod p int64 mod p RNS mod p RecInt mod p

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 20 / 51

slide-51
SLIDE 51

Reductions and building blocks

Outline

1

Choosing the underlying arithmetic Using machine word arithmetic Larger field sizes

2

Reductions and building blocks

3

Gaussian elimination Which reduction Computing rank profiles Algorithmic instances Relation to other decompositions The small rank case

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 21 / 51

slide-52
SLIDE 52

Reductions and building blocks

Reductions to building blocks

Huge number of algorithmic variants for a given computation. Need to structure the design for a set of routines :

◮ Focus tuning effort on a single routine ◮ Some operations deliver better efficiency: ⊲ in practice: memory access pattern ⊲ in theory: better algorithms

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 22 / 51

slide-53
SLIDE 53

Reductions and building blocks

Memory access pattern

◮ The memory wall: communication speed

improves slower than arithmetic

2 4 6 8 10 12 14 16 18 20 22 1980 1985 1990 1995 2000 2005 2010 2015 Speed Time CPU arithmetic throughput Memory speed

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 23 / 51

slide-54
SLIDE 54

Reductions and building blocks

Memory access pattern

◮ The memory wall: communication speed

improves slower than arithmetic

◮ Deep memory hierarchy

2 4 6 8 10 12 14 16 18 20 22 1980 1985 1990 1995 2000 2005 2010 2015 Speed Time CPU arithmetic throughput Memory speed CPU L1 RAM L2 L3

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 23 / 51

slide-55
SLIDE 55

Reductions and building blocks

Memory access pattern

◮ The memory wall: communication speed

improves slower than arithmetic

◮ Deep memory hierarchy

Need to overlap communications by computation Design of BLAS 3 [Dongarra & Al. 87]

◮ Group all ops in Matrix products gemm:

Work O(n3) >> Data O(n2) MatMul has become a building block in practice

2 4 6 8 10 12 14 16 18 20 22 1980 1985 1990 1995 2000 2005 2010 2015 Speed Time CPU arithmetic throughput Memory speed CPU L1 RAM L2 L3

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 23 / 51

slide-56
SLIDE 56

Reductions and building blocks

Sub-cubic linear algebra

< 1969: O(n3) for everyone (Gauss, Householder, Danilevsk˘ ıi, etc)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 24 / 51

slide-57
SLIDE 57

Reductions and building blocks

Sub-cubic linear algebra

< 1969: O(n3) for everyone (Gauss, Householder, Danilevsk˘ ıi, etc)

Matrix Multiplication O(nω)

[Strassen 69]: O(n2.807) . . . [Sch¨

  • nhage 81]

O(n2.52) . . . [Coppersmith, Winograd 90] O(n2.375) . . . [Le Gall 14] O(n2.3728639)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 24 / 51

slide-58
SLIDE 58

Reductions and building blocks

Sub-cubic linear algebra

< 1969: O(n3) for everyone (Gauss, Householder, Danilevsk˘ ıi, etc)

Matrix Multiplication O(nω)

[Strassen 69]: O(n2.807) . . . [Sch¨

  • nhage 81]

O(n2.52) . . . [Coppersmith, Winograd 90] O(n2.375) . . . [Le Gall 14] O(n2.3728639)

Other operations

[Strassen 69]: Inverse in O(nω) [Sch¨

  • nhage 72]:

QR in O(nω) [Bunch, Hopcroft 74]: LU in O(nω) [Ibarra & al. 82]: Rank in O(nω) [Keller-Gehrig 85]: CharPoly in O(nω log n)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 24 / 51

slide-59
SLIDE 59

Reductions and building blocks

Sub-cubic linear algebra

< 1969: O(n3) for everyone (Gauss, Householder, Danilevsk˘ ıi, etc)

Matrix Multiplication O(nω)

[Strassen 69]: O(n2.807) . . . [Sch¨

  • nhage 81]

O(n2.52) . . . [Coppersmith, Winograd 90] O(n2.375) . . . [Le Gall 14] O(n2.3728639)

Other operations

[Strassen 69]: Inverse in O(nω) [Sch¨

  • nhage 72]:

QR in O(nω) [Bunch, Hopcroft 74]: LU in O(nω) [Ibarra & al. 82]: Rank in O(nω) [Keller-Gehrig 85]: CharPoly in O(nω log n)

MatMul has become a building block in theoretical reductions

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 24 / 51

slide-60
SLIDE 60

Reductions and building blocks

Reductions: theory

HNF(Z) Det(Z) LinSys(Z) MM(Z) SNF(Z) Det(Zp) LU(Zp) CharPoly(Zp) MinPoly(Zp) TRSM(Zp) MM(Zp) LinSys(Zp)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 25 / 51

slide-61
SLIDE 61

Reductions and building blocks

Reductions: theory

HNF(Z) Det(Z) LinSys(Z) MM(Z) SNF(Z) Det(Zp) LU(Zp) CharPoly(Zp) MinPoly(Zp) TRSM(Zp) MM(Zp) LinSys(Zp)

Common mistrust Fast linear algebra is ✗ never faster ✗ numerically unstable

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 25 / 51

slide-62
SLIDE 62

Reductions and building blocks

Reductions: theory and practice

HNF(Z) Det(Z) LinSys(Z) MM(Z) SNF(Z) Det(Zp) LU(Zp) CharPoly(Zp) MinPoly(Zp) TRSM(Zp) MM(Zp) LinSys(Zp)

Common mistrust Fast linear algebra is ✗ never faster ✗ numerically unstable Lucky coincidence ✓ same building block in theory and in practice reduction trees are still relevant

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 25 / 51

slide-63
SLIDE 63

Reductions and building blocks

Reductions: theory and practice

HNF(Z) Det(Z) LinSys(Z) MM(Z) SNF(Z) Det(Zp) LU(Zp) CharPoly(Zp) MinPoly(Zp) TRSM(Zp) MM(Zp) LinSys(Zp)

Common mistrust Fast linear algebra is ✗ never faster ✗ numerically unstable Lucky coincidence ✓ same building block in theory and in practice reduction trees are still relevant Road map towards efficiency in practice

1

Tune the MatMul building block.

2

Tune the reductions.

3

New reductions.

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 25 / 51

slide-64
SLIDE 64

Reductions and building blocks

Putting it together: MatMul building block over Z/pZ

Ingedients [FFLAS-FFPACK library]

◮ Compute over Z and delay modular reductions

k

  • p−1

2

2 < 2mantissa

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 26 / 51

slide-65
SLIDE 65

Reductions and building blocks

Putting it together: MatMul building block over Z/pZ

Ingedients [FFLAS-FFPACK library]

◮ Compute over Z and delay modular reductions

k

  • p−1

2

2 < 253

◮ Fastest integer arithmetic: double ◮ Cache optimizations

numerical BLAS

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 26 / 51

slide-66
SLIDE 66

Reductions and building blocks

Putting it together: MatMul building block over Z/pZ

Ingedients [FFLAS-FFPACK library]

◮ Compute over Z and delay modular reductions

9ℓ k

2ℓ p−1 2

2 < 253

◮ Fastest integer arithmetic: double ◮ Cache optimizations

numerical BLAS

◮ Strassen-Winograd 6n2.807 + . . .

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 26 / 51

slide-67
SLIDE 67

Reductions and building blocks

Putting it together: MatMul building block over Z/pZ

Ingedients [FFLAS-FFPACK library]

◮ Compute over Z and delay modular reductions

9ℓ k

2ℓ p−1 2

2 < 253

◮ Fastest integer arithmetic: double ◮ Cache optimizations

numerical BLAS

◮ Strassen-Winograd 6n2.807 + . . .

with memory efficient schedules [Boyer, Dumas, P. and Zhou 09] Tradeoffs: Extra memory Overwriting input Leading constant Fully in-place in 7.2n2.807 + . . .

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 26 / 51

slide-68
SLIDE 68

Reductions and building blocks

Sequential Matrix Multiplication

10 20 30 40 50 60 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 2n3/time/109 (Gfops equiv.) matrix dimension i5−3320M at 2.6Ghz with AVX 1 OpenBLAS sgemm

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 27 / 51

slide-69
SLIDE 69

Reductions and building blocks

Sequential Matrix Multiplication

10 20 30 40 50 60 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 2n3/time/109 (Gfops equiv.) matrix dimension i5−3320M at 2.6Ghz with AVX 1 FFLAS fgemm over Z/83Z OpenBLAS sgemm

p = 83, 1 mod / 10000 mul.

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 27 / 51

slide-70
SLIDE 70

Reductions and building blocks

Sequential Matrix Multiplication

10 20 30 40 50 60 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 2n3/time/109 (Gfops equiv.) matrix dimension i5−3320M at 2.6Ghz with AVX 1 FFLAS fgemm over Z/83Z FFLAS fgemm over Z/821Z OpenBLAS sgemm

p = 83, 1 mod / 10000 mul. p = 821, 1 mod / 100 mul.

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 27 / 51

slide-71
SLIDE 71

Reductions and building blocks

Sequential Matrix Multiplication

10 20 30 40 50 60 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 2n3/time/109 (Gfops equiv.) matrix dimension i5−3320M at 2.6Ghz with AVX 1 FFLAS fgemm over Z/83Z FFLAS fgemm over Z/821Z OpenBLAS sgemm FFLAS fgemm over Z/1898131Z FFLAS fgemm over Z/18981307Z OpenBLAS dgemm

p = 83, 1 mod / 10000 mul. p = 821, 1 mod / 100 mul. p = 1898131, 1 mod / 10000 mul. p = 18981307, 1 mod / 100 mul.

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 27 / 51

slide-72
SLIDE 72

Reductions and building blocks

Reductions in dense linear algebra

LU decomposition

◮ Block recursive algorithm reduces to MatMul O(nω)

n 1000 5000 10000 15000 20000 LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s 113.66 fflas-ffpack 0.058s 2.46s 16.08s 47.47s 105.96s Intel Haswell E3-1270 3.0Ghz using OpenBLAS-0.2.9

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 28 / 51

slide-73
SLIDE 73

Reductions and building blocks

Reductions in dense linear algebra

LU decomposition

◮ Block recursive algorithm reduces to MatMul O(nω)

n 1000 5000 10000 15000 20000 LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s 113.66 fflas-ffpack 0.058s 2.46s 16.08s 47.47s 105.96s Intel Haswell E3-1270 3.0Ghz using OpenBLAS-0.2.9

Characteristic Polynomial

◮ A new reduction to matrix multiplication in O(nω).

n 1000 2000 5000 10000 magma-v2.19-9 1.38s 24.28s 332.7s 2497s fflas-ffpack 0.532s 2.936s 32.71s 219.2s Intel Ivy-Bridge i5-3320 2.6Ghz using OpenBLAS-0.2.9

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 28 / 51

slide-74
SLIDE 74

Reductions and building blocks

Reductions in dense linear algebra

LU decomposition

◮ Block recursive algorithm reduces to MatMul O(nω)

n 1000 5000 10000 15000 20000 LAPACK-dgetrf 0.024s 2.01s 14.88s 48.78s 113.66 fflas-ffpack 0.058s 2.46s 16.08s 47.47s 105.96s Intel Haswell E3-1270 3.0Ghz using OpenBLAS-0.2.9

×7.63 ×6.59 Characteristic Polynomial

◮ A new reduction to matrix multiplication in O(nω).

n 1000 2000 5000 10000 magma-v2.19-9 1.38s 24.28s 332.7s 2497s fflas-ffpack 0.532s 2.936s 32.71s 219.2s Intel Ivy-Bridge i5-3320 2.6Ghz using OpenBLAS-0.2.9

×7.5 ×6.7

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 28 / 51

slide-75
SLIDE 75

Gaussian elimination

Outline

1

Choosing the underlying arithmetic Using machine word arithmetic Larger field sizes

2

Reductions and building blocks

3

Gaussian elimination Which reduction Computing rank profiles Algorithmic instances Relation to other decompositions The small rank case

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 29 / 51

slide-76
SLIDE 76

Gaussian elimination Which reduction

The case of Gaussian elimination

Which reduction to MatMul ?

Slab iterative Slab recursive

LAPACK FFLAS-FFPACK

Tile iterative Tile recursive

PLASMA FFLAS-FFPACK

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 30 / 51

slide-77
SLIDE 77

Gaussian elimination Which reduction

The case of Gaussian elimination

Which reduction to MatMul ?

Slab recursive

FFLAS-FFPACK

Tile recursive

FFLAS-FFPACK

◮ Sub-cubic complexity: recursive algorithms

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 30 / 51

slide-78
SLIDE 78

Gaussian elimination Which reduction

The case of Gaussian elimination

Which reduction to MatMul ?

Tile recursive

FFLAS-FFPACK

◮ Sub-cubic complexity: recursive algorithms ◮ Data locality

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 30 / 51

slide-79
SLIDE 79

Gaussian elimination Computing rank profiles

Computing rank profiles

Rank profiles: first linearly independent columns

◮ Major invariant of a matrix (echelon form) ◮ Gr¨

  • bner basis computations (Macaulay matrix)

◮ Krylov methods

Gaussian elimination revealing echelon forms: [Ibarra, Moran and Hui 82]

S = A L P

[Keller-Gehrig 85]

= A R X

[Jeannerod, P. and Storjohann 13]

A E = P L

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 31 / 51

slide-80
SLIDE 80

Gaussian elimination Computing rank profiles

Computing rank profiles

Lessons learned (or what we thought was necessary):

◮ treat rows in order ◮ exhaust all columns before considering the next row ◮ slab block splitting required (recursive or iterative)

similar to partial pivoting Need for a more flexible pivoting

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 32 / 51

slide-81
SLIDE 81

Gaussian elimination Computing rank profiles

Computing rank profiles

Lessons learned (or what we thought was necessary):

◮ treat rows in order ◮ exhaust all columns before considering the next row ◮ slab block splitting required (recursive or iterative)

similar to partial pivoting Need for a more flexible pivoting Gathering linear independence invariants Two ways to look at a matrix: row- or column-wise

◮ Row rank profile, column echelon form, ◮ Column rank profile, row echelon form,

Can’t a unique invariant capture all information ?

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 32 / 51

slide-82
SLIDE 82

Gaussian elimination Computing rank profiles

The rank profile matrix [Dumas, P. and Sultan’15]

Definition (Rank Profile matrix) The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leading sub-matrix of RA and of A have the same rank.

R A

1 2 3 4 2 4 5 8 1 2 3 4 3 5 9 12 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 A R

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 33 / 51

slide-83
SLIDE 83

Gaussian elimination Computing rank profiles

The rank profile matrix [Dumas, P. and Sultan’15]

Definition (Rank Profile matrix) The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leading sub-matrix of RA and of A have the same rank.

R A

Theorem

◮ RowRP and ColRP read directly on R(A) ◮ Same holds for any (i, j)-leading submatrix.

1 2 3 4 2 4 5 8 1 2 3 4 3 5 9 12 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 A R

RowRP = {1} ColRP = {1}

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 33 / 51

slide-84
SLIDE 84

Gaussian elimination Computing rank profiles

The rank profile matrix [Dumas, P. and Sultan’15]

Definition (Rank Profile matrix) The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leading sub-matrix of RA and of A have the same rank.

R A

Theorem

◮ RowRP and ColRP read directly on R(A) ◮ Same holds for any (i, j)-leading submatrix.

1 2 3 4 2 4 5 8 1 2 3 4 3 5 9 12 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 A R

RowRP = {1,2} ColRP = {1,3}

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 33 / 51

slide-85
SLIDE 85

Gaussian elimination Computing rank profiles

The rank profile matrix [Dumas, P. and Sultan’15]

Definition (Rank Profile matrix) The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leading sub-matrix of RA and of A have the same rank.

R A

Theorem

◮ RowRP and ColRP read directly on R(A) ◮ Same holds for any (i, j)-leading submatrix.

1 2 3 4 2 4 5 8 1 2 3 4 3 5 9 12 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 A R

RowRP = {1,4} ColRP = {1,2}

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 33 / 51

slide-86
SLIDE 86

Gaussian elimination Computing rank profiles

The rank profile matrix [Dumas, P. and Sultan’15]

Definition (Rank Profile matrix) The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leading sub-matrix of RA and of A have the same rank.

R A

Theorem

◮ RowRP and ColRP read directly on R(A) ◮ Same holds for any (i, j)-leading submatrix.

1 2 3 4 2 4 5 8 1 2 3 4 3 5 9 12 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 A R

RowRP = {1,4} ColRP = {1,2} A = PLUQ = P L M Im−r

  • Ir
  • U

V In−r

  • Q
  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 33 / 51

slide-87
SLIDE 87

Gaussian elimination Computing rank profiles

The rank profile matrix [Dumas, P. and Sultan’15]

Definition (Rank Profile matrix) The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leading sub-matrix of RA and of A have the same rank.

R A

Theorem

◮ RowRP and ColRP read directly on R(A) ◮ Same holds for any (i, j)-leading submatrix.

1 2 3 4 2 4 5 8 1 2 3 4 3 5 9 12 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 A R

RowRP = {1,4} ColRP = {1,2} A = PLUQ = P L M Im−r

  • P T P

Ir

  • QQT

U V In−r

  • Q
  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 33 / 51

slide-88
SLIDE 88

Gaussian elimination Computing rank profiles

The rank profile matrix [Dumas, P. and Sultan’15]

Definition (Rank Profile matrix) The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leading sub-matrix of RA and of A have the same rank.

R A

Theorem

◮ RowRP and ColRP read directly on R(A) ◮ Same holds for any (i, j)-leading submatrix.

1 2 3 4 2 4 5 8 1 2 3 4 3 5 9 12 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 A R

RowRP = {1,4} ColRP = {1,2} A = PLUQ = P L M Im−r

  • P T
  • L

P Ir

  • Q
  • ΠP,Q

QT U V In−r

  • Q
  • U
  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 33 / 51

slide-89
SLIDE 89

Gaussian elimination Computing rank profiles

The rank profile matrix [Dumas, P. and Sultan’15]

Definition (Rank Profile matrix) The unique RA ∈ {0, 1}m×n such that any pair of (i, j)-leading sub-matrix of RA and of A have the same rank.

R A

Theorem

◮ RowRP and ColRP read directly on R(A) ◮ Same holds for any (i, j)-leading submatrix.

1 2 3 4 2 4 5 8 1 2 3 4 3 5 9 12 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 A R

RowRP = {1,4} ColRP = {1,2} A = PLUQ = P L M Im−r

  • P T
  • L

P Ir

  • Q
  • ΠP,Q

QT U V In−r

  • Q
  • U

When does ΠP,Q = R(A) ?

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 33 / 51

slide-90
SLIDE 90

Gaussian elimination Computing rank profiles

Anatomy of a PLUQ decomposition

Four types of elementary operations: Search: finding a pivot

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 34 / 51

slide-91
SLIDE 91

Gaussian elimination Computing rank profiles

Anatomy of a PLUQ decomposition

Four types of elementary operations: Search: finding a pivot Permutation: moving the pivot to the main diagonal

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 34 / 51

slide-92
SLIDE 92

Gaussian elimination Computing rank profiles

Anatomy of a PLUQ decomposition

Four types of elementary operations: Search: finding a pivot Permutation: moving the pivot to the main diagonal Normalization: computing L: li,k ← ai,k

ak,k

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 34 / 51

slide-93
SLIDE 93

Gaussian elimination Computing rank profiles

Anatomy of a PLUQ decomposition

Four types of elementary operations: Search: finding a pivot Permutation: moving the pivot to the main diagonal Normalization: computing L: li,k ← ai,k

ak,k

Update: applying the elimination ai,j ← ai,j − ai,kak,j

ak,k

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 34 / 51

slide-94
SLIDE 94

Gaussian elimination Computing rank profiles

Impact on the PLUQ decomposition

Normalization: determines whether L or U is unit diagonal

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 35 / 51

slide-95
SLIDE 95

Gaussian elimination Computing rank profiles

Impact on the PLUQ decomposition

Normalization: determines whether L or U is unit diagonal Update: no impact on the decomposition, only in the scheduling:

◮ iterative, tile/slab iterative, recursive, ◮ left/right looking, Crout

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 35 / 51

slide-96
SLIDE 96

Gaussian elimination Computing rank profiles

Impact on the PLUQ decomposition

Normalization: determines whether L or U is unit diagonal Update: no impact on the decomposition, only in the scheduling:

◮ iterative, tile/slab iterative, recursive, ◮ left/right looking, Crout

Search: defines the first r values of P and Q

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 35 / 51

slide-97
SLIDE 97

Gaussian elimination Computing rank profiles

Impact on the PLUQ decomposition

Normalization: determines whether L or U is unit diagonal Update: no impact on the decomposition, only in the scheduling:

◮ iterative, tile/slab iterative, recursive, ◮ left/right looking, Crout

Search: defines the first r values of P and Q Permutation: impacts all values of P and Q

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 35 / 51

slide-98
SLIDE 98

Gaussian elimination Computing rank profiles

Impact on the PLUQ decomposition

Normalization: determines whether L or U is unit diagonal Update: no impact on the decomposition, only in the scheduling:

◮ iterative, tile/slab iterative, recursive, ◮ left/right looking, Crout

Search: defines the first r values of P and Q Permutation: impacts all values of P and Q Problem (Reformulation) Under what conditions on the Search and Permutation operations does a PLUQ decomposition algorithm reveals RowRP, ColRP or RA?

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 35 / 51

slide-99
SLIDE 99

Gaussian elimination Computing rank profiles

The Pivoting matrix

Definition (The pivoting matrix) Given a PLUQ decomposition A = PLUQ with rank r, define ΠP,Q = P Ir

  • Q.

Locates the position of the pivots in the matrix A.

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 36 / 51

slide-100
SLIDE 100

Gaussian elimination Computing rank profiles

The Pivoting matrix

Definition (The pivoting matrix) Given a PLUQ decomposition A = PLUQ with rank r, define ΠP,Q = P Ir

  • Q.

Locates the position of the pivots in the matrix A. Problem (Rank profile revealing PLUQ decompositions) Under which conditions

◮ ΠP,Q = RA

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 36 / 51

slide-101
SLIDE 101

Gaussian elimination Computing rank profiles

The Pivoting matrix

Definition (The pivoting matrix) Given a PLUQ decomposition A = PLUQ with rank r, define ΠP,Q = P Ir

  • Q.

Locates the position of the pivots in the matrix A. Problem (Rank profile revealing PLUQ decompositions) Under which conditions

◮ ΠP,Q = RA ◮ RowSupp(ΠP,Q) = RowSupp(RA) = RowRP(A) (Weaker) ◮ ColSupp(ΠP,Q) = ColSupp(RA) = ColRP(A) (Weaker)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 36 / 51

slide-102
SLIDE 102

Gaussian elimination Computing rank profiles

The Search operation

Various strategies depending on the context Numerical stability: find the absolute largest pivot Data locality: find pivot not too far from the main diagonal Sparsity: find pivot that minimizes/reduce fill-in

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 37 / 51

slide-103
SLIDE 103

Gaussian elimination Computing rank profiles

The Search operation

Various strategies depending on the context Numerical stability: find the absolute largest pivot Data locality: find pivot not too far from the main diagonal Sparsity: find pivot that minimizes/reduce fill-in Search revealing rank profiles

◮ No stability issue over exact domains ◮ Intuition: must minimize some ordering of the row/col indices

(notion of rank profile)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 37 / 51

slide-104
SLIDE 104

Gaussian elimination Computing rank profiles

The Search operation

Various strategies depending on the context Numerical stability: find the absolute largest pivot Data locality: find pivot not too far from the main diagonal Sparsity: find pivot that minimizes/reduce fill-in Search revealing rank profiles

◮ No stability issue over exact domains ◮ Intuition: must minimize some ordering of the row/col indices

(notion of rank profile) Example Search: “Any non zero element on the topmost row”:

A = 2 0 3 0

1 0 0 0 0 0 4 0 0 2 0 1

  • ⇒ RA =

1 0 0 0

0 0 1 0 0 0 0 0 0 1 0 0

  • ,
  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 37 / 51

slide-105
SLIDE 105

Gaussian elimination Computing rank profiles

The Search operation

Various strategies depending on the context Numerical stability: find the absolute largest pivot Data locality: find pivot not too far from the main diagonal Sparsity: find pivot that minimizes/reduce fill-in Search revealing rank profiles

◮ No stability issue over exact domains ◮ Intuition: must minimize some ordering of the row/col indices

(notion of rank profile) Example Search: “Any non zero element on the topmost row”:

A = 2 0 3 0

1 0 0 0 0 0 4 0 0 2 0 1

  • ⇒ RA =

1 0 0 0

0 0 1 0 0 0 0 0 0 1 0 0

  • , ΠP,Q =

0 0 1 0

1 0 0 0 0 0 0 0 0 1 0 0

  • .
  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 37 / 51

slide-106
SLIDE 106

Gaussian elimination Computing rank profiles

The Search operation

Various strategies depending on the context Numerical stability: find the absolute largest pivot Data locality: find pivot not too far from the main diagonal Sparsity: find pivot that minimizes/reduce fill-in Search revealing rank profiles

◮ No stability issue over exact domains ◮ Intuition: must minimize some ordering of the row/col indices

(notion of rank profile) Example Search: “Any non zero element on the topmost row”:

A = 2 0 3 0

1 0 0 0 0 0 4 0 0 2 0 1

  • ⇒ RA =

1 0 0 0

0 0 1 0 0 0 0 0 0 1 0 0

  • , ΠP,Q =

0 0 1 0

1 0 0 0 0 0 0 0 0 1 0 0

  • .

RowRP={1,2,4}

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 37 / 51

slide-107
SLIDE 107

Gaussian elimination Pivoting strategies

Pivoting and permutation strategies

Pivot Search

Pivot’s (i, j) position minimizes some pre-order: Row

  • rder: any non-zero on the first non-zero row

1 1 1 1 1

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 38 / 51

slide-108
SLIDE 108

Gaussian elimination Pivoting strategies

Pivoting and permutation strategies

Pivot Search

Pivot’s (i, j) position minimizes some pre-order: Row/Col order: any non-zero on the first non-zero row/col

1 1 1 1 1

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 38 / 51

slide-109
SLIDE 109

Gaussian elimination Pivoting strategies

Pivoting and permutation strategies

Pivot Search

Pivot’s (i, j) position minimizes some pre-order: Row/Col order: any non-zero on the first non-zero row/col Lex

  • rder: first non-zero on the first non-zero row

1 1 1 1 1

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 38 / 51

slide-110
SLIDE 110

Gaussian elimination Pivoting strategies

Pivoting and permutation strategies

Pivot Search

Pivot’s (i, j) position minimizes some pre-order: Row/Col order: any non-zero on the first non-zero row/col Lex/RevLex order: first non-zero on the first non-zero row/col

1 1 1 1 1

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 38 / 51

slide-111
SLIDE 111

Gaussian elimination Pivoting strategies

Pivoting and permutation strategies

Pivot Search

Pivot’s (i, j) position minimizes some pre-order: Row/Col order: any non-zero on the first non-zero row/col Lex/RevLex order: first non-zero on the first non-zero row/col Product order: first non-zero in the (i, j) leading sub-matrix

1 1 1 1 1

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 38 / 51

slide-112
SLIDE 112

Gaussian elimination Pivoting strategies

Sufficient ?

Is lexicographic ordering sufficient to reveal both rank profiles? Example

With a lexicographic ordering

1

A =    2 0 3 0 1 0 0 0 0 0 4 0 0 2 0 1    ⇒ RA =    1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0    = ΠP,Q

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 39 / 51

slide-113
SLIDE 113

Gaussian elimination Pivoting strategies

Sufficient ?

Is lexicographic ordering sufficient to reveal both rank profiles? Example

With a lexicographic ordering

1

A =    2 0 3 0 1 0 0 0 0 0 4 0 0 2 0 1    ⇒ RA =    1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0    = ΠP,Q

2

But A =

  • 0 0 1

2 3 0

  • RA =
  • 0 0 1

1 0 0

  • and ΠP,Q =
  • 0 0 1

0 1 0

  • .
  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 39 / 51

slide-114
SLIDE 114

Gaussian elimination Pivoting strategies

Sufficient ?

Is lexicographic ordering sufficient to reveal both rank profiles? Example

With a lexicographic ordering

1

A =    2 0 3 0 1 0 0 0 0 0 4 0 0 2 0 1    ⇒ RA =    1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0    = ΠP,Q

2

But A =

  • 0 0 1

2 3 0

  • RA =
  • 0 0 1

1 0 0

  • and ΠP,Q =
  • 0 0 1

0 1 0

  • .

Pivot Swaps mix-up precedence between rows/cols. Permutations also have to be considered

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 39 / 51

slide-115
SLIDE 115

Gaussian elimination Pivoting strategies

Pivoting and permutation strategies

Pivot Search

Pivot’s (i, j) position minimizes some pre-order: Row/Col order: any non-zero on the first non-zero row/col Lex/RevLex order: first non-zero on the first non-zero row/col Product order: first non-zero in the (i, j) leading sub-matrix

Permutation

◮ Transpositions

1 1 1 1 1

1 1 1 1 1

Transposition

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 40 / 51

slide-116
SLIDE 116

Gaussian elimination Pivoting strategies

Pivoting and permutation strategies

Pivot Search

Pivot’s (i, j) position minimizes some pre-order: Row/Col order: any non-zero on the first non-zero row/col Lex/RevLex order: first non-zero on the first non-zero row/col Product order: first non-zero in the (i, j) leading sub-matrix

Permutation

◮ Transpositions ◮ Cyclic Rotations

1 1 1 1 1

1 1 1 1 1

Cyclic rotation

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 40 / 51

slide-117
SLIDE 117

Gaussian elimination Pivoting strategies

Pivoting strategies revealing rank profiles

For any type of PLUQ algorithm: iterative / block iterative / recursive

Search Row perm.

  • Col. perm.

RowRP ColRP RA Instance Row order

  • Col. order

Lexico.

  • Rev. lex.

Product

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 41 / 51

slide-118
SLIDE 118

Gaussian elimination Pivoting strategies

Pivoting strategies revealing rank profiles

For any type of PLUQ algorithm: iterative / block iterative / recursive

Search Row perm.

  • Col. perm.

RowRP ColRP RA Instance Row order Transposition Transposition ✓ [IMH82] [JPS13]

  • Col. order

Lexico.

  • Rev. lex.

Product

◮ RowRP =

1 2 . . . m P Ir

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 41 / 51

slide-119
SLIDE 119

Gaussian elimination Pivoting strategies

Pivoting strategies revealing rank profiles

For any type of PLUQ algorithm: iterative / block iterative / recursive

Search Row perm.

  • Col. perm.

RowRP ColRP RA Instance Row order Transposition Transposition ✓ [IMH82] [JPS13]

  • Col. order

Transposition Transposition ✓ [KG85] [JPS13] Lexico.

  • Rev. lex.

Product

◮ RowRP =

1 2 . . . m P Ir

  • ◮ ColRP =

Ir Q 1 2 . . . mT

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 41 / 51

slide-120
SLIDE 120

Gaussian elimination Pivoting strategies

Pivoting strategies revealing rank profiles

For any type of PLUQ algorithm: iterative / block iterative / recursive

Search Row perm.

  • Col. perm.

RowRP ColRP RA Instance Row order Transposition Transposition ✓ [IMH82] [JPS13]

  • Col. order

Transposition Transposition ✓ [KG85] [JPS13] Lexico. Transposition Transposition ✓ [Sto00]

  • Rev. lex.

Transposition Transposition ✓ [Sto00] Product

◮ RowRP =

1 2 . . . m P Ir

  • ◮ ColRP =

Ir Q 1 2 . . . mT

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 41 / 51

slide-121
SLIDE 121

Gaussian elimination Pivoting strategies

Pivoting strategies revealing rank profiles

For any type of PLUQ algorithm: iterative / block iterative / recursive

Search Row perm.

  • Col. perm.

RowRP ColRP RA Instance Row order Transposition Transposition ✓ [IMH82] [JPS13]

  • Col. order

Transposition Transposition ✓ [KG85] [JPS13] Lexico. Transposition Transposition ✓ [Sto00]

  • Rev. lex.

Transposition Transposition ✓ [Sto00] Product Rotation Rotation ✓ ✓ ✓ [DPS13]

◮ RowRP =

1 2 . . . m P Ir

  • ◮ ColRP =

Ir Q 1 2 . . . mT

◮ RA = P

Ir

  • Q
  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 41 / 51

slide-122
SLIDE 122

Gaussian elimination Pivoting strategies

Pivoting strategies revealing rank profiles

For any type of PLUQ algorithm: iterative / block iterative / recursive

Search Row perm.

  • Col. perm.

RowRP ColRP RA Instance Row order Transposition Transposition ✓ [IMH82] [JPS13]

  • Col. order

Transposition Transposition ✓ [KG85] [JPS13] Lexico. Transposition Transposition ✓ [Sto00]

  • Rev. lex.

Transposition Transposition ✓ [Sto00] Product Rotation Transposition ✓ [DPS15] Product Transposition Rotation ✓ [DPS15] Product Rotation Rotation ✓ ✓ ✓ [DPS13]

◮ RowRP =

1 2 . . . m P Ir

  • ◮ ColRP =

Ir Q 1 2 . . . mT

◮ RA = P

Ir

  • Q
  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 41 / 51

slide-123
SLIDE 123

Gaussian elimination Pivoting strategies

Pivoting strategies revealing rank profiles

For any type of PLUQ algorithm: iterative / block iterative / recursive

Search Row perm.

  • Col. perm.

RowRP ColRP RA Instance Row order Transposition Transposition ✓ [IMH82] [JPS13]

  • Col. order

Transposition Transposition ✓ [KG85] [JPS13] Lexico. Transposition Transposition ✓ [Sto00] Lexico. Transposition Rotation ✓ ✓ ✓ [DPS15] Lexico. Rotation Rotation ✓ ✓ ✓ [DPS15]

  • Rev. lex.

Transposition Transposition ✓ [Sto00]

  • Rev. lex.

Rotation Transposition ✓ ✓ ✓ [DPS15]

  • Rev. lex.

Rotation Rotation ✓ ✓ ✓ [DPS15] Product Rotation Transposition ✓ [DPS15] Product Transposition Rotation ✓ [DPS15] Product Rotation Rotation ✓ ✓ ✓ [DPS13]

◮ RowRP =

1 2 . . . m P Ir

  • ◮ ColRP =

Ir Q 1 2 . . . mT

◮ RA = P

Ir

  • Q
  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 41 / 51

slide-124
SLIDE 124

Gaussian elimination Algorithmic instances

The slab recursive algorithm

Slab Recursive LU [IMH82, KG85, Sto00, JPS13]

A1 A2

1 Split A Row-wise

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 42 / 51

slide-125
SLIDE 125

Gaussian elimination Algorithmic instances

The slab recursive algorithm

Slab Recursive LU [IMH82, KG85, Sto00, JPS13]

V A

1 1

U

C

21

A

22 1 Split A Row-wise 2 Recursive call on A1

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 42 / 51

slide-126
SLIDE 126

Gaussian elimination Algorithmic instances

The slab recursive algorithm

Slab Recursive LU [IMH82, KG85, Sto00, JPS13]

V A

1 1

U

C

22

G

1 Split A Row-wise 2 Recursive call on A1 3 G ← A21U −1

1

(trsm)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 42 / 51

slide-127
SLIDE 127

Gaussian elimination Algorithmic instances

The slab recursive algorithm

Slab Recursive LU [IMH82, KG85, Sto00, JPS13]

V H

1 1

U

C

G

1 Split A Row-wise 2 Recursive call on A1 3 G ← A21U −1

1

(trsm)

4 H ← A22 − G × V (MM)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 42 / 51

slide-128
SLIDE 128

Gaussian elimination Algorithmic instances

The slab recursive algorithm

Slab Recursive LU [IMH82, KG85, Sto00, JPS13]

G C C

U

1 1 2

U

2 1 Split A Row-wise 2 Recursive call on A1 3 G ← A21U −1

1

(trsm)

4 H ← A22 − G × V (MM) 5 Recursive call on H

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 42 / 51

slide-129
SLIDE 129

Gaussian elimination Algorithmic instances

The slab recursive algorithm

Slab Recursive LU [IMH82, KG85, Sto00, JPS13]

G C C

U U

1 1 2 2 1 Split A Row-wise 2 Recursive call on A1 3 G ← A21U −1

1

(trsm)

4 H ← A22 − G × V (MM) 5 Recursive call on H 6 Row permutations

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 42 / 51

slide-130
SLIDE 130

Gaussian elimination Algorithmic instances

The slab recursive algorithm

Slab Recursive LU [IMH82, KG85, Sto00, JPS13]

G C C

U U

1 1 2 2 1 Split A Row-wise 2 Recursive call on A1 3 G ← A21U −1

1

(trsm)

4 H ← A22 − G × V (MM) 5 Recursive call on H 6 Row permutations

Implements the lexicographic order search.

◮ Col/Row Transpositions : Computes the ColRP

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 42 / 51

slide-131
SLIDE 131

Gaussian elimination Algorithmic instances

The slab recursive algorithm

Slab Recursive LU [IMH82, KG85, Sto00, JPS13]

G C C

U U

1 1 2 2 1 Split A Row-wise 2 Recursive call on A1 3 G ← A21U −1

1

(trsm)

4 H ← A22 − G × V (MM) 5 Recursive call on H 6 Row permutations

Implements the lexicographic order search.

◮ Col/Row Transpositions : Computes the ColRP ◮ Row Rotations : Computes RA [DPS15]

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 42 / 51

slide-132
SLIDE 132

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 2 × 2 block splitting

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-133
SLIDE 133

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 Recursive call

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-134
SLIDE 134

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 TRSM: B ← BU −1

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-135
SLIDE 135

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 TRSM: B ← L−1B

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-136
SLIDE 136

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 MatMul: C ← C − A × B

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-137
SLIDE 137

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 MatMul: C ← C − A × B

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-138
SLIDE 138

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 MatMul: C ← C − A × B

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-139
SLIDE 139

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 2 independent recursive calls (compatible with the product order)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-140
SLIDE 140

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 TRSM: B ← BU −1

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-141
SLIDE 141

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 TRSM: B ← L−1B

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-142
SLIDE 142

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 MatMul: C ← C − A × B

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-143
SLIDE 143

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 MatMul: C ← C − A × B

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-144
SLIDE 144

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 MatMul: C ← C − A × B

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-145
SLIDE 145

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 Recursive call

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-146
SLIDE 146

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13 Puzzle game (block rotations)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-147
SLIDE 147

Gaussian elimination Algorithmic instances

The tiled recursive algorithm

Dumas, P. and Sultan 13

◮ O(mnrω−2) (2/3n3 for ω = 3) ◮ fewer modular reductions than slab algorithms ◮ rank deficiency introduces parallelism

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 43 / 51

slide-148
SLIDE 148

Gaussian elimination Algorithmic instances

Iterative algorithms

◮ Unefficient with large problems ◮ Good for base case implementations (faster in-cache computation)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 44 / 51

slide-149
SLIDE 149

Gaussian elimination Algorithmic instances

Iterative algorithms

◮ Unefficient with large problems ◮ Good for base case implementations (faster in-cache computation)

Which base case algorithm?

◮ Formerly [DPS13]: product order iterative

algorithm ✗ many permutations ✗ many modular reductions

  • L

v U w j i r r

j i

A

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 44 / 51

slide-150
SLIDE 150

Gaussian elimination Algorithmic instances

Iterative algorithms

◮ Unefficient with large problems ◮ Good for base case implementations (faster in-cache computation)

Which base case algorithm?

◮ Formerly [DPS13]: product order iterative

algorithm ✗ many permutations ✗ many modular reductions

  • L

v U w j i r r

j i

A

◮ [DPS15]: Simply use the schoolbook algorithm (Lexico+Rotations)

✓ fewer permutations ✓ modular reductions delayed more easily ✓ Crout variant: better data access pattern

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 44 / 51

slide-151
SLIDE 151

Gaussian elimination Algorithmic instances

1 2 3 4 5 6 7 8 9 10 100 200 300 400 500 600 700 Effective Gfops n PLUQ base cases mod 131071. Rank = n/2. on a i5-3320 at 2.6GHz Left-looking Product (4) Right-Looking Product (5) Pure Recursive (6)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 45 / 51

slide-152
SLIDE 152

Gaussian elimination Algorithmic instances

1 2 3 4 5 6 7 8 9 10 100 200 300 400 500 600 700 Effective Gfops n PLUQ base cases mod 131071. Rank = n/2. on a i5-3320 at 2.6GHz Rec->Left look. Prod. (2) Left-looking Product (4) Right-Looking Product (5) Pure Recursive (6)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 45 / 51

slide-153
SLIDE 153

Gaussian elimination Algorithmic instances

1 2 3 4 5 6 7 8 9 10 100 200 300 400 500 600 700 Effective Gfops n PLUQ base cases mod 131071. Rank = n/2. on a i5-3320 at 2.6GHz Crout Lexico. (3) Rec->Left look. Prod. (2) Left-looking Product (4) Right-Looking Product (5) Pure Recursive (6)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 45 / 51

slide-154
SLIDE 154

Gaussian elimination Algorithmic instances

1 2 3 4 5 6 7 8 9 10 100 200 300 400 500 600 700 Effective Gfops n PLUQ base cases mod 131071. Rank = n/2. on a i5-3320 at 2.6GHz Rec->Crout Lexico. (1) Crout Lexico. (3) Rec->Left look. Prod. (2) Left-looking Product (4) Right-Looking Product (5) Pure Recursive (6)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 45 / 51

slide-155
SLIDE 155

Gaussian elimination Algorithmic instances

2 4 6 8 10 12 14 16 18 20 500 1000 1500 2000 2500 3000 Effective Gfops n PLUQ base cases mod 131071. Rank = n/2. on a i5-3320 at 2.6GHz Rec->Crout Lexico. (1) Crout Lexico. (3) Rec->Left look. Prod. (2) Left-looking Product (4) Right-Looking Product (5) Pure Recursive (6)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 45 / 51

slide-156
SLIDE 156

Gaussian elimination Algorithmic instances

2 4 6 8 10 12 14 16 18 20 500 1000 1500 2000 2500 3000 Effective Gfops n PLUQ base cases mod 131071. Rank = n/2. on a i5-3320 at 2.6GHz Rec->Crout Lexico. (1) Crout Lexico. (3) Rec->Left look. Prod. (2) Left-looking Product (4) Right-Looking Product (5) Pure Recursive (6)

◮ > 2 Gfops improvement ◮ Implemented in FFLAS-FFPACK (kernel of LinBox).

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 45 / 51

slide-157
SLIDE 157

Gaussian elimination Relation to other decompositions

LUP and PLU decompositions

LUP If A has generic RowRP

◮ LUP(A) with Lex order and col. rot.:

Ir 0P = RA In particular, if A has full row rank and m = n: P = RA

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 46 / 51

slide-158
SLIDE 158

Gaussian elimination Relation to other decompositions

LUP and PLU decompositions

LUP If A has generic RowRP

◮ LUP(A) with Lex order and col. rot.:

Ir 0P = RA In particular, if A has full row rank and m = n: P = RA PLU If A has generic ColRP

◮ PLU(A) with RevLex order and row rot.

PIr 0 = RA In particular, if A has full column rank and m = n: P = RA

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 46 / 51

slide-159
SLIDE 159

Gaussian elimination Relation to other decompositions

Echelon forms

R A

P Q Q U L P C E = C=PLP Q UQ=E for sort

s s

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 47 / 51

slide-160
SLIDE 160

Gaussian elimination The small rank case

Small rank

When r ≪ m, n, O(mnrω−2) can be too expensive. (Compressed sensing applications) [Cheung Kwok Lau’12]: Compute the rank r and r linearly independent rows in O˜ (rω + mn) probabilistic

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 48 / 51

slide-161
SLIDE 161

Gaussian elimination The small rank case

Small rank

When r ≪ m, n, O(mnrω−2) can be too expensive. (Compressed sensing applications) [Cheung Kwok Lau’12]: Compute the rank r and r linearly independent rows in O˜ (rω + mn) probabilistic [Storjohann Yang’14:] Rank profile in O˜ (r3 + mn) probabilistic.

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 48 / 51

slide-162
SLIDE 162

Gaussian elimination The small rank case

Small rank

When r ≪ m, n, O(mnrω−2) can be too expensive. (Compressed sensing applications) [Cheung Kwok Lau’12]: Compute the rank r and r linearly independent rows in O˜ (rω + mn) probabilistic [Storjohann Yang’14:] Rank profile in O˜ (r3 + mn) probabilistic. [Storjohann Yang’15:] Rank profile in O˜ (rω + mn) probabilistic.

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 48 / 51

slide-163
SLIDE 163

Gaussian elimination The small rank case

Small rank

When r ≪ m, n, O(mnrω−2) can be too expensive. (Compressed sensing applications) [Cheung Kwok Lau’12]: Compute the rank r and r linearly independent rows in O˜ (rω + mn) probabilistic [Storjohann Yang’14:] Rank profile in O˜ (r3 + mn) probabilistic. [Storjohann Yang’15:] Rank profile in O˜ (rω + mn) probabilistic. Can the rank profile matrix be computed in similar complexities?

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 48 / 51

slide-164
SLIDE 164

Gaussian elimination The small rank case

[Storjohann Yang’14] Linear System Oracle

Sketch of the O˜ (r3 + mn) algorithm Incrementally for s = 1..rank(A), maintain

◮ an s × s invertible sub-matrix As of A. ◮ its inverse A−1 s ◮ a partial solution Asxs = bs to a linear system Ax = b.

= .

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 49 / 51

slide-165
SLIDE 165

Gaussian elimination The small rank case

[Storjohann Yang’14] Linear System Oracle

Sketch of the O˜ (r3 + mn) algorithm Incrementally for s = 1..rank(A), maintain

◮ an s × s invertible sub-matrix As of A. ◮ its inverse A−1 s ◮ a partial solution Asxs = bs to a linear system Ax = b. 1

Use A−1

s

to find the next row and column to append to As. O(sn)

= .

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 49 / 51

slide-166
SLIDE 166

Gaussian elimination The small rank case

[Storjohann Yang’14] Linear System Oracle

Sketch of the O˜ (r3 + mn) algorithm Incrementally for s = 1..rank(A), maintain

◮ an s × s invertible sub-matrix As of A. ◮ its inverse A−1 s ◮ a partial solution Asxs = bs to a linear system Ax = b. 1

Use A−1

s

to find the next row and column to append to As. O(sn)

= .

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 49 / 51

slide-167
SLIDE 167

Gaussian elimination The small rank case

[Storjohann Yang’14] Linear System Oracle

Sketch of the O˜ (r3 + mn) algorithm Incrementally for s = 1..rank(A), maintain

◮ an s × s invertible sub-matrix As of A. ◮ its inverse A−1 s ◮ a partial solution Asxs = bs to a linear system Ax = b. 1

Use A−1

s

to find the next row and column to append to As. O(sn)

= .

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 49 / 51

slide-168
SLIDE 168

Gaussian elimination The small rank case

[Storjohann Yang’14] Linear System Oracle

Sketch of the O˜ (r3 + mn) algorithm Incrementally for s = 1..rank(A), maintain

◮ an s × s invertible sub-matrix As of A. ◮ its inverse A−1 s ◮ a partial solution Asxs = bs to a linear system Ax = b. 1

Use A−1

s

to find the next row and column to append to As. O(sn)

= .

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 49 / 51

slide-169
SLIDE 169

Gaussian elimination The small rank case

[Storjohann Yang’14] Linear System Oracle

Sketch of the O˜ (r3 + mn) algorithm Incrementally for s = 1..rank(A), maintain

◮ an s × s invertible sub-matrix As of A. ◮ its inverse A−1 s ◮ a partial solution Asxs = bs to a linear system Ax = b. 1

Use A−1

s

to find the next row and column to append to As. O(sn)

2

Compute A−1

s+1 by rank 1 updates

O(s2)

= .

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 49 / 51

slide-170
SLIDE 170

Gaussian elimination The small rank case

[Storjohann Yang’14] Linear System Oracle

Sketch of the O˜ (r3 + mn) algorithm Incrementally for s = 1..rank(A), maintain

◮ an s × s invertible sub-matrix As of A. ◮ its inverse A−1 s ◮ a partial solution Asxs = bs to a linear system Ax = b. 1

Use A−1

s

to find the next row and column to append to As. O(sn)

2

Compute A−1

s+1 by rank 1 updates

O(s2)

= .

◮ Use the vector b to compress row linear

dependency information

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 49 / 51

slide-171
SLIDE 171

Gaussian elimination The small rank case

[Storjohann Yang’14] Linear System Oracle

Sketch of the O˜ (r3 + mn) algorithm Incrementally for s = 1..rank(A), maintain

◮ an s × s invertible sub-matrix As of A. ◮ its inverse A−1 s ◮ a partial solution Asxs = bs to a linear system Ax = b. 1

Use A−1

s

to find the next row and column to append to As. O(s log n)

2

Compute A−1

s+1 by rank 1 updates

O(s2)

= .

◮ Use the vector b to compress row linear

dependency information

◮ Improved by linear independence oracles

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 49 / 51

slide-172
SLIDE 172

Gaussian elimination The small rank case

[Storjohann Yang’14] Linear System Oracle

Sketch of the O˜ (r3 + mn) algorithm Incrementally for s = 1..rank(A), maintain

◮ an s × s invertible sub-matrix As of A. ◮ its inverse A−1 s ◮ a partial solution Asxs = bs to a linear system Ax = b. 1

Use A−1

s

to find the next row and column to append to As. O(s log n)

2

Compute A−1

s+1 by rank 1 updates

O(s2)

= .

◮ Use the vector b to compress row linear

dependency information

◮ Improved by linear independence oracles

  • Lexico. search with rotations computes RA
  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 49 / 51

slide-173
SLIDE 173

Gaussian elimination The small rank case

[Storjohann Yang’15] Relaxed matrix inverse

Sketch of the algorithm: RowRP in O˜ (rω + mn)

1

Insted of building A−1

s

iteratively (O(r3)), use an asymptotically fast relaxation scheme O(rω).

2

Requires to deal with only r columns in generic column RP.

3

Ensured by a call to [Cheung Kwok Lau’12] + Toeplitz preconditionner

4

Returns the row rank profile

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 50 / 51

slide-174
SLIDE 174

Gaussian elimination The small rank case

[Storjohann Yang’15] Relaxed matrix inverse

Sketch of the algorithm: RowRP in O˜ (rω + mn)

1

Insted of building A−1

s

iteratively (O(r3)), use an asymptotically fast relaxation scheme O(rω).

2

Requires to deal with only r columns in generic column RP.

3

Ensured by a call to [Cheung Kwok Lau’12] + Toeplitz preconditionner

4

Returns the row rank profile Problem: step 3 loses information required for the RA.

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 50 / 51

slide-175
SLIDE 175

Gaussian elimination The small rank case

[Storjohann Yang’15] Relaxed matrix inverse

Sketch of the algorithm: RowRP in O˜ (rω + mn)

1

Insted of building A−1

s

iteratively (O(r3)), use an asymptotically fast relaxation scheme O(rω).

2

Requires to deal with only r columns in generic column RP.

3

Ensured by a call to [Cheung Kwok Lau’12] + Toeplitz preconditionner

4

Returns the row rank profile Problem: step 3 loses information required for the RA. Solution for RA in O˜ (rω + mn)

1

Compute the RowRP I by [Storjohann Yang’15] on A

2

Compute the ColRP J by [Storjohann Yang’15] on AT

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 50 / 51

slide-176
SLIDE 176

Gaussian elimination The small rank case

[Storjohann Yang’15] Relaxed matrix inverse

Sketch of the algorithm: RowRP in O˜ (rω + mn)

1

Insted of building A−1

s

iteratively (O(r3)), use an asymptotically fast relaxation scheme O(rω).

2

Requires to deal with only r columns in generic column RP.

3

Ensured by a call to [Cheung Kwok Lau’12] + Toeplitz preconditionner

4

Returns the row rank profile Problem: step 3 loses information required for the RA. Solution for RA in O˜ (rω + mn)

1

Compute the RowRP I by [Storjohann Yang’15] on A

2

Compute the ColRP J by [Storjohann Yang’15] on AT

3

Extract the r × r submatrix Ar = AI,J

4

Compute the LUP decomp of Ar with col. rotations

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 50 / 51

slide-177
SLIDE 177

Gaussian elimination The small rank case

[Storjohann Yang’15] Relaxed matrix inverse

Sketch of the algorithm: RowRP in O˜ (rω + mn)

1

Insted of building A−1

s

iteratively (O(r3)), use an asymptotically fast relaxation scheme O(rω).

2

Requires to deal with only r columns in generic column RP.

3

Ensured by a call to [Cheung Kwok Lau’12] + Toeplitz preconditionner

4

Returns the row rank profile Problem: step 3 loses information required for the RA. Solution for RA in O˜ (rω + mn)

1

Compute the RowRP I by [Storjohann Yang’15] on A

2

Compute the ColRP J by [Storjohann Yang’15] on AT

3

Extract the r × r submatrix Ar = AI,J

4

Compute the LUP decomp of Ar with col. rotations

5

Recover RA by inflating RAr = P with zeroes.

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 50 / 51

slide-178
SLIDE 178

Gaussian elimination The small rank case

Conclusion

Design framework for high performance exact linear algebra Asymptotic reduction > algorithm tuning > building block implementation

◮ So far, floating point arithmetic delivers best speed

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 51 / 51

slide-179
SLIDE 179

Gaussian elimination The small rank case

Conclusion

Design framework for high performance exact linear algebra Asymptotic reduction > algorithm tuning > building block implementation

◮ So far, floating point arithmetic delivers best speed ◮ Medium size arithmetic: RNS

harnesses floating point efficiency embarrassingly easy parallelization (and fault tolerance)

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 51 / 51

slide-180
SLIDE 180

Gaussian elimination The small rank case

Conclusion

Design framework for high performance exact linear algebra Asymptotic reduction > algorithm tuning > building block implementation

◮ So far, floating point arithmetic delivers best speed ◮ Medium size arithmetic: RNS

harnesses floating point efficiency embarrassingly easy parallelization (and fault tolerance)

◮ Favor tiled recursive algorithms

architecture oblivious vs aware algorithms [Gustavson 07]

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 51 / 51

slide-181
SLIDE 181

Gaussian elimination The small rank case

Conclusion

Design framework for high performance exact linear algebra Asymptotic reduction > algorithm tuning > building block implementation

◮ So far, floating point arithmetic delivers best speed ◮ Medium size arithmetic: RNS

harnesses floating point efficiency embarrassingly easy parallelization (and fault tolerance)

◮ Favor tiled recursive algorithms

architecture oblivious vs aware algorithms [Gustavson 07]

◮ New pivoting strategies revealing all rank profile informations

tournament pivoting? [Demmel, Grigori and Xiang 11]

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 51 / 51

slide-182
SLIDE 182

Gaussian elimination The small rank case

Conclusion

Design framework for high performance exact linear algebra Asymptotic reduction > algorithm tuning > building block implementation

◮ So far, floating point arithmetic delivers best speed ◮ Medium size arithmetic: RNS

harnesses floating point efficiency embarrassingly easy parallelization (and fault tolerance)

◮ Favor tiled recursive algorithms

architecture oblivious vs aware algorithms [Gustavson 07]

◮ New pivoting strategies revealing all rank profile informations

tournament pivoting? [Demmel, Grigori and Xiang 11] Thank you

  • C. Pernet (LIP, U. Grenoble Alpes)

Computing the Rank Profile Matrix JNCF, Cluny, 2 nov. 2015 51 / 51