Approximate Neumann Series or Exact Matrix Inversion for Massive - - PowerPoint PPT Presentation

approximate neumann series or exact matrix inversion for
SMART_READER_LITE
LIVE PREVIEW

Approximate Neumann Series or Exact Matrix Inversion for Massive - - PowerPoint PPT Presentation

Approximate Neumann Series or Exact Matrix Inversion for Massive MIMO? Oscar Gustafsson, Erik Bertilsson, Johannes Klasson, and Carl Ingemarsson Channel matrix, Gram matrix, to be inverted for zero forcing (or MMSE) : conjugate


slide-1
SLIDE 1

Approximate Neumann Series or Exact Matrix Inversion for Massive MIMO?

Oscar Gustafsson, Erik Bertilsson, Johannes Klasson, and Carl Ingemarsson

slide-2
SLIDE 2

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 1

Matrix Inversion in Massive MIMO

  • N terminals, M antennas
  • Channel matrix,
  • Gram matrix,

to be inverted for zero forcing (or MMSE)

  • : conjugate symmetric (Hermitian) and

semi-definite

  • : with uncorrelated channels and

, diagonally dominant

slide-3
SLIDE 3

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 1

Matrix Inversion in Massive MIMO

  • N terminals, M antennas
  • Channel matrix, H ∈ CM×N
  • Gram matrix, X = HHH ∈ CN×N to be inverted

for zero forcing (or MMSE)

  • : conjugate symmetric (Hermitian) and

semi-definite

  • : with uncorrelated channels and

, diagonally dominant

slide-4
SLIDE 4

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 1

Matrix Inversion in Massive MIMO

  • N terminals, M antennas
  • Channel matrix, H ∈ CM×N
  • Gram matrix, X = HHH ∈ CN×N to be inverted

for zero forcing (or MMSE)

  • X: conjugate symmetric (Hermitian) and

semi-definite

  • X: with uncorrelated channels and M ≫ N,

diagonally dominant

slide-5
SLIDE 5

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 2

Matrix Inversion in Massive MIMO

P UL UL UL UL DL G DL G Tframe NUL,1 NUL,2 NDL

  • One matrix inversion per frame
  • Computed between reception of pilot and

transmission of first downlink data

  • Latency, not throughput
slide-6
SLIDE 6

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 2

Matrix Inversion in Massive MIMO

P UL UL UL UL DL G DL G Tframe NUL,1 NUL,2 NDL

  • One matrix inversion per frame
  • Computed between reception of pilot and

transmission of first downlink data

  • Latency, not throughput
slide-7
SLIDE 7

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 2

Matrix Inversion in Massive MIMO

P UL UL UL UL DL G DL G Tframe NUL,1 NUL,2 NDL

  • One matrix inversion per frame
  • Computed between reception of pilot and

transmission of first downlink data

  • Latency, not throughput
slide-8
SLIDE 8

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 3

Algorithms for Matrix Inversion

  • Exact algorithms
  • Numerical issues, especially in fixed-point, for

close to singular (sub-)matrices

  • Division and/or square-roots
  • Cubic complexity
  • LDL -decomposition
  • Lowest operation count
  • Reasonable fixed-point properties
  • No square-roots
slide-9
SLIDE 9

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 3

Algorithms for Matrix Inversion

  • Exact algorithms
  • Numerical issues, especially in fixed-point, for

close to singular (sub-)matrices

  • Division and/or square-roots
  • Cubic complexity
  • LDL⊺-decomposition
  • Lowest operation count
  • Reasonable fixed-point properties
  • No square-roots
slide-10
SLIDE 10

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 4

Algorithms for Matrix Inversion

  • Neumann series expansion
  • Precondition matrix A ≈ X−1

ˆ X−1

K =

K

  • n=1

(I − AX)n−1

  • A,

(1)

  • “High parallelism”
  • “Low complexity”
  • “No division”
  • “Numerically stable”
slide-11
SLIDE 11

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 4

Algorithms for Matrix Inversion

  • Neumann series expansion
  • Precondition matrix A ≈ X−1

ˆ X−1

K =

K

  • n=1

(I − AX)n−1

  • A,

(1)

  • “High parallelism”
  • “Low complexity”
  • “No division”
  • “Numerically stable”
slide-12
SLIDE 12

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 5

Algorithms for Matrix Inversion

Diagonal precondition matrix A =      a1,1 · · · a2,2 . . . . . . ... . . . . . . · · · aN,N     

. . . ... . . . . . .

slide-13
SLIDE 13

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 5

Algorithms for Matrix Inversion

Diagonal precondition matrix A =      a1,1 · · · a2,2 . . . . . . ... . . . . . . · · · aN,N      ai,i = 1/xi,i

I − AX =      y1,2 · · · y1,N y2,1 . . . y2,N . . . ... . . . . . . yN,1 yN,2 · · ·     

slide-14
SLIDE 14

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 6

Algorithms for Matrix Inversion

Tri-diagonal precondition matrix A =        a1,1 a1,2 · · · a2,1 a2,2 a2,3 . . . a3,2 a3,3 . . . . . . ... . . . . . . · · · aN,N        Sequential computation of Generic

slide-15
SLIDE 15

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 6

Algorithms for Matrix Inversion

Tri-diagonal precondition matrix A =        a1,1 a1,2 · · · a2,1 a2,2 a2,3 . . . a3,2 a3,3 . . . . . . ... . . . . . . · · · aN,N        Sequential computation of A Generic I − AX

slide-16
SLIDE 16

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 7

Algorithms for Matrix Inversion

Diagonal + column precondition matrix A =      a1,1 · · · a2,1 a2,2 . . . . . . ... . . . . . . aN,1 · · · aN,N      . . . ... . . . . . .

slide-17
SLIDE 17

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 7

Algorithms for Matrix Inversion

Diagonal + column precondition matrix A =      a1,1 · · · a2,1 a2,2 . . . . . . ... . . . . . . aN,1 · · · aN,N      I − AX =      y1,2 · · · y1,N y2,2bb . . . y2,N . . . ... . . . . . . yN,2 · · · yN,N     

slide-18
SLIDE 18

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 8

Computational Complexity

  • The latency (time to obtain the result) of an

algorithm depends on two aspects:

  • Total number of operations

latency scales with number of processing elements (PEs)

  • Number of sequential operations

latency does not scale with number of PEs

  • Pipelining of the PEs
  • Increases clock frequency
  • Increases latency
slide-19
SLIDE 19

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 8

Computational Complexity

  • The latency (time to obtain the result) of an

algorithm depends on two aspects:

  • Total number of operations → latency scales with

number of processing elements (PEs)

  • Number of sequential operations

latency does not scale with number of PEs

  • Pipelining of the PEs
  • Increases clock frequency
  • Increases latency
slide-20
SLIDE 20

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 8

Computational Complexity

  • The latency (time to obtain the result) of an

algorithm depends on two aspects:

  • Total number of operations → latency scales with

number of processing elements (PEs)

  • Number of sequential operations → latency does

not scale with number of PEs

  • Pipelining of the PEs
  • Increases clock frequency
  • Increases latency
slide-21
SLIDE 21

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 8

Computational Complexity

  • The latency (time to obtain the result) of an

algorithm depends on two aspects:

  • Total number of operations → latency scales with

number of processing elements (PEs)

  • Number of sequential operations → latency does

not scale with number of PEs

  • Pipelining of the PEs
  • Increases clock frequency
  • Increases latency
slide-22
SLIDE 22

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 9

Computational Complexity Example

4 × 4 exact matrix inversion based on LDL⊺

slide-23
SLIDE 23

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 10

How Many Cycles?

  • Assume multiply-and-add (MAD) operations
  • Reciprocals performed using Newton-Raphson →

a number of sequential MAD operations

  • Sum-of-products computed using sequential

MADs

  • perations, each with

pipeline stages implemented on processing elements (PEs) require

alg latency

  • cycles. (2)
slide-24
SLIDE 24

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 10

How Many Cycles?

  • Assume multiply-and-add (MAD) operations
  • Reciprocals performed using Newton-Raphson →

a number of sequential MAD operations

  • Sum-of-products computed using sequential

MADs

  • perations, each with

pipeline stages implemented on processing elements (PEs) require

alg latency

  • cycles. (2)
slide-25
SLIDE 25

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 10

How Many Cycles?

  • Assume multiply-and-add (MAD) operations
  • Reciprocals performed using Newton-Raphson →

a number of sequential MAD operations

  • Sum-of-products computed using sequential

MADs

  • O operations, each with P pipeline stages

implemented on Q processing elements (PEs) require Calg ≥ max O Q

  • + P − 1, PClatency
  • cycles. (2)
slide-26
SLIDE 26

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 11

Algorithm Comparison – Complexity

Method MADs Reciprocals Exact method LDL⊺+EQU

1 2N3 + 1 2N2 − N

N Neumann series Diagonal, K = 2 N2 − N N K = 3

1 2N3 + N2 − 1 2N

N Tri-diagonals, K = 2 3N2 + 7N − 10 2N − 1 K = 3

1 2N3 + 6N2 + 1 2N − 2

2N − 1

  • Diag. + column, K = 2

3 2N2 + 5 2N − 4

N K = 3

1 2N3 + 5 2N2 − 2N − 1

N

slide-27
SLIDE 27

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 12

Algorithm Comparison – Latency

Method MADs Reciprocals Exact method LDL⊺+EQU 4N − 4 N Neumann series Diagonal, K = 2 2 1 K = 3 N + 1 1 Tri-diagonals, K = 2 2N + 5 N K = 3 3N + 5 N

  • Diag. + column, K = 2

N + 2 1 K = 3 2N + 1 1

slide-28
SLIDE 28

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 13

Results

Bit-error rate for the four approaches, N = 20, M = 120

1 2 3 4 5 10-8 10-6 10-4 10-2 100

Diagonal Column Diagonal Tridiagonal LDL

slide-29
SLIDE 29

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 14

Results

Reciprocal ⇒ Three sequential MAD operations 4 × 4-matrix

#PE: 1, latency: 48

20 40

Cycle

0.5 1

#Operations #PE: 2, latency: 29

5 10 15 20 25

Cycle

1 2

#Operations #PE: 3, latency: 26

5 10 15 20 25

Cycle

2 4

#Operations #PE: 4, latency: 25

5 10 15 20 25

Cycle

2 4

#Operations

slide-30
SLIDE 30

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 15

Results – 16 × 16

Solid: actual result, dashed: from equation

5 10 15

Processing elements

102 103 104

Cycles

Tri-diagonal

  • Col. + Diag.

Diagonal Exact

slide-31
SLIDE 31

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 16

Results – 8 × 8

Solid: actual result, dashed: from equation

5 10 15

Processing elements

101 102 103

Cycles

  • Col. + Diag.

Diagonal Exact

slide-32
SLIDE 32

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 17

Results

With P = 1, 2, 3, 4 levels of pipelining 4 × 4-matrix

P: 1, latency: 48

20 40

Cycle

0.5 1

#Operations P: 2, latency: 57

10 20 30 40 50

Cycle

0.5 1

#Operations P: 3, latency: 77

20 40 60

Cycle

0.5 1

#Operations P: 4, latency: 98

20 40 60 80

Cycle

0.5 1

#Operations

slide-33
SLIDE 33

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 18

Results – 16 × 16

Time in single cycle latency operations, assuming pipelining increases speed linearly Solid: P = 1, dashed: P = 2, dash-dotted: P = 3

1 2 3 4

Processing elements

102 103

Time

  • Col. + Diag.

Diagonal Exact

slide-34
SLIDE 34

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 19

Results – 8 × 8

Time in single cycle latency operations, assuming pipelining increases speed linearly Solid: P = 1, dashed: P = 2, dash-dotted: P = 3

1 2 3 4

Processing elements

101 102

Time

  • Col. + Diag.

Diagonal Exact

slide-35
SLIDE 35

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20

Design Example

  • Assume a latency requirement of 0.05 ms (10% of

an LTE-like frame with 2 UL and 2 DL symbols)

  • For

and one PE, cycles are required for the exact algorithm

  • One PE operating at

clk

MHz

  • clk

MHz

  • 2 kInv/s, idle 90% of the time
slide-36
SLIDE 36

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20

Design Example

  • Assume a latency requirement of 0.05 ms (10% of

an LTE-like frame with 2 UL and 2 DL symbols)

  • For N = 8 and one PE, 304 cycles are required for

the exact algorithm

  • One PE operating at

clk

MHz

  • clk

MHz

  • 2 kInv/s, idle 90% of the time
slide-37
SLIDE 37

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20

Design Example

  • Assume a latency requirement of 0.05 ms (10% of

an LTE-like frame with 2 UL and 2 DL symbols)

  • For N = 8 and one PE, 304 cycles are required for

the exact algorithm

  • One PE operating at fclk = 6.08 MHz
  • clk

MHz

  • 2 kInv/s, idle 90% of the time
slide-38
SLIDE 38

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20

Design Example

  • Assume a latency requirement of 0.05 ms (10% of

an LTE-like frame with 2 UL and 2 DL symbols)

  • For N = 8 and one PE, 304 cycles are required for

the exact algorithm

  • One PE operating at fclk = 6.08 MHz
  • N = 30 ⇒ fclk ≈ 280 MHz
  • 2 kInv/s, idle 90% of the time
slide-39
SLIDE 39

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20

Design Example

  • Assume a latency requirement of 0.05 ms (10% of

an LTE-like frame with 2 UL and 2 DL symbols)

  • For N = 8 and one PE, 304 cycles are required for

the exact algorithm

  • One PE operating at fclk = 6.08 MHz
  • N = 30 ⇒ fclk ≈ 280 MHz
  • 2 kInv/s, idle 90% of the time
slide-40
SLIDE 40

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 21

Is Neumann useful at all?

  • If less than three terms are used, the complexity

may be lower

  • Only compute parts of the third iteration
  • Allow increasing the number of terminals further
  • But numerically most efficient when the ratio

between number of antennas and terminals is high

  • May give a better result with singular or close to

singular matrices (not correct result maybe not as bad as an exact algorithm)

  • (Really) large matrices
slide-41
SLIDE 41

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 21

Is Neumann useful at all?

  • If less than three terms are used, the complexity

may be lower

  • Only compute parts of the third iteration
  • Allow increasing the number of terminals further
  • But numerically most efficient when the ratio

between number of antennas and terminals is high

  • May give a better result with singular or close to

singular matrices (not correct result maybe not as bad as an exact algorithm)

  • (Really) large matrices
slide-42
SLIDE 42

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 21

Is Neumann useful at all?

  • If less than three terms are used, the complexity

may be lower

  • Only compute parts of the third iteration
  • Allow increasing the number of terminals further
  • But numerically most efficient when the ratio

between number of antennas and terminals is high

  • May give a better result with singular or close to

singular matrices (not correct result maybe not as bad as an exact algorithm)

  • (Really) large matrices
slide-43
SLIDE 43

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 21

Is Neumann useful at all?

  • If less than three terms are used, the complexity

may be lower

  • Only compute parts of the third iteration
  • Allow increasing the number of terminals further
  • But numerically most efficient when the ratio

between number of antennas and terminals is high

  • May give a better result with singular or close to

singular matrices (not correct result maybe not as bad as an exact algorithm)

  • (Really) large matrices
slide-44
SLIDE 44

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22

Conclusions

  • Latency, not throughput
  • Complexity for Neumann series with

higher than best exact algorithm

  • Few terms for Neumann when diagonally

dominant

  • Diagonally dominant

well conditioned exact algorithm behaves well

  • Few terminals

more diagonally dominant fewer Neumann terms (but also less complexity for exact algorithm)

  • With few PEs compared to matrix size, the limited

parallelism of the exact algorithm is no problem

  • Required latency/parallelism determined by frame

structure

slide-45
SLIDE 45

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22

Conclusions

  • Latency, not throughput
  • Complexity for Neumann series with K = 3 higher

than best exact algorithm

  • Few terms for Neumann when diagonally

dominant

  • Diagonally dominant

well conditioned exact algorithm behaves well

  • Few terminals

more diagonally dominant fewer Neumann terms (but also less complexity for exact algorithm)

  • With few PEs compared to matrix size, the limited

parallelism of the exact algorithm is no problem

  • Required latency/parallelism determined by frame

structure

slide-46
SLIDE 46

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22

Conclusions

  • Latency, not throughput
  • Complexity for Neumann series with K = 3 higher

than best exact algorithm

  • Few terms for Neumann when diagonally

dominant

  • Diagonally dominant

well conditioned exact algorithm behaves well

  • Few terminals

more diagonally dominant fewer Neumann terms (but also less complexity for exact algorithm)

  • With few PEs compared to matrix size, the limited

parallelism of the exact algorithm is no problem

  • Required latency/parallelism determined by frame

structure

slide-47
SLIDE 47

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22

Conclusions

  • Latency, not throughput
  • Complexity for Neumann series with K = 3 higher

than best exact algorithm

  • Few terms for Neumann when diagonally

dominant

  • Diagonally dominant ⇒ well conditioned

exact algorithm behaves well

  • Few terminals

more diagonally dominant fewer Neumann terms (but also less complexity for exact algorithm)

  • With few PEs compared to matrix size, the limited

parallelism of the exact algorithm is no problem

  • Required latency/parallelism determined by frame

structure

slide-48
SLIDE 48

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22

Conclusions

  • Latency, not throughput
  • Complexity for Neumann series with K = 3 higher

than best exact algorithm

  • Few terms for Neumann when diagonally

dominant

  • Diagonally dominant ⇒ well conditioned ⇒ exact

algorithm behaves well

  • Few terminals ⇒ more diagonally dominant

fewer Neumann terms (but also less complexity for exact algorithm)

  • With few PEs compared to matrix size, the limited

parallelism of the exact algorithm is no problem

  • Required latency/parallelism determined by frame

structure

slide-49
SLIDE 49

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22

Conclusions

  • Latency, not throughput
  • Complexity for Neumann series with K = 3 higher

than best exact algorithm

  • Few terms for Neumann when diagonally

dominant

  • Diagonally dominant ⇒ well conditioned ⇒ exact

algorithm behaves well

  • Few terminals ⇒ more diagonally dominant ⇒

fewer Neumann terms (but also less complexity for exact algorithm)

  • With few PEs compared to matrix size, the limited

parallelism of the exact algorithm is no problem

  • Required latency/parallelism determined by frame

structure

slide-50
SLIDE 50

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22

Conclusions

  • Latency, not throughput
  • Complexity for Neumann series with K = 3 higher

than best exact algorithm

  • Few terms for Neumann when diagonally

dominant

  • Diagonally dominant ⇒ well conditioned ⇒ exact

algorithm behaves well

  • Few terminals ⇒ more diagonally dominant ⇒

fewer Neumann terms (but also less complexity for exact algorithm)

  • With few PEs compared to matrix size, the limited

parallelism of the exact algorithm is no problem

  • Required latency/parallelism determined by frame

structure

slide-51
SLIDE 51

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22

Conclusions

  • Latency, not throughput
  • Complexity for Neumann series with K = 3 higher

than best exact algorithm

  • Few terms for Neumann when diagonally

dominant

  • Diagonally dominant ⇒ well conditioned ⇒ exact

algorithm behaves well

  • Few terminals ⇒ more diagonally dominant ⇒

fewer Neumann terms (but also less complexity for exact algorithm)

  • With few PEs compared to matrix size, the limited

parallelism of the exact algorithm is no problem

  • Required latency/parallelism determined by frame

structure

slide-52
SLIDE 52

Thank you! Questions?

www.liu.se