Approximate Neumann Series or Exact Matrix Inversion for Massive - - PowerPoint PPT Presentation
Approximate Neumann Series or Exact Matrix Inversion for Massive - - PowerPoint PPT Presentation
Approximate Neumann Series or Exact Matrix Inversion for Massive MIMO? Oscar Gustafsson, Erik Bertilsson, Johannes Klasson, and Carl Ingemarsson Channel matrix, Gram matrix, to be inverted for zero forcing (or MMSE) : conjugate
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 1
Matrix Inversion in Massive MIMO
- N terminals, M antennas
- Channel matrix,
- Gram matrix,
to be inverted for zero forcing (or MMSE)
- : conjugate symmetric (Hermitian) and
semi-definite
- : with uncorrelated channels and
, diagonally dominant
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 1
Matrix Inversion in Massive MIMO
- N terminals, M antennas
- Channel matrix, H ∈ CM×N
- Gram matrix, X = HHH ∈ CN×N to be inverted
for zero forcing (or MMSE)
- : conjugate symmetric (Hermitian) and
semi-definite
- : with uncorrelated channels and
, diagonally dominant
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 1
Matrix Inversion in Massive MIMO
- N terminals, M antennas
- Channel matrix, H ∈ CM×N
- Gram matrix, X = HHH ∈ CN×N to be inverted
for zero forcing (or MMSE)
- X: conjugate symmetric (Hermitian) and
semi-definite
- X: with uncorrelated channels and M ≫ N,
diagonally dominant
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 2
Matrix Inversion in Massive MIMO
P UL UL UL UL DL G DL G Tframe NUL,1 NUL,2 NDL
- One matrix inversion per frame
- Computed between reception of pilot and
transmission of first downlink data
- Latency, not throughput
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 2
Matrix Inversion in Massive MIMO
P UL UL UL UL DL G DL G Tframe NUL,1 NUL,2 NDL
- One matrix inversion per frame
- Computed between reception of pilot and
transmission of first downlink data
- Latency, not throughput
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 2
Matrix Inversion in Massive MIMO
P UL UL UL UL DL G DL G Tframe NUL,1 NUL,2 NDL
- One matrix inversion per frame
- Computed between reception of pilot and
transmission of first downlink data
- Latency, not throughput
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 3
Algorithms for Matrix Inversion
- Exact algorithms
- Numerical issues, especially in fixed-point, for
close to singular (sub-)matrices
- Division and/or square-roots
- Cubic complexity
- LDL -decomposition
- Lowest operation count
- Reasonable fixed-point properties
- No square-roots
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 3
Algorithms for Matrix Inversion
- Exact algorithms
- Numerical issues, especially in fixed-point, for
close to singular (sub-)matrices
- Division and/or square-roots
- Cubic complexity
- LDL⊺-decomposition
- Lowest operation count
- Reasonable fixed-point properties
- No square-roots
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 4
Algorithms for Matrix Inversion
- Neumann series expansion
- Precondition matrix A ≈ X−1
ˆ X−1
K =
K
- n=1
(I − AX)n−1
- A,
(1)
- “High parallelism”
- “Low complexity”
- “No division”
- “Numerically stable”
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 4
Algorithms for Matrix Inversion
- Neumann series expansion
- Precondition matrix A ≈ X−1
ˆ X−1
K =
K
- n=1
(I − AX)n−1
- A,
(1)
- “High parallelism”
- “Low complexity”
- “No division”
- “Numerically stable”
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 5
Algorithms for Matrix Inversion
Diagonal precondition matrix A = a1,1 · · · a2,2 . . . . . . ... . . . . . . · · · aN,N
. . . ... . . . . . .
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 5
Algorithms for Matrix Inversion
Diagonal precondition matrix A = a1,1 · · · a2,2 . . . . . . ... . . . . . . · · · aN,N ai,i = 1/xi,i
I − AX = y1,2 · · · y1,N y2,1 . . . y2,N . . . ... . . . . . . yN,1 yN,2 · · ·
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 6
Algorithms for Matrix Inversion
Tri-diagonal precondition matrix A = a1,1 a1,2 · · · a2,1 a2,2 a2,3 . . . a3,2 a3,3 . . . . . . ... . . . . . . · · · aN,N Sequential computation of Generic
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 6
Algorithms for Matrix Inversion
Tri-diagonal precondition matrix A = a1,1 a1,2 · · · a2,1 a2,2 a2,3 . . . a3,2 a3,3 . . . . . . ... . . . . . . · · · aN,N Sequential computation of A Generic I − AX
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 7
Algorithms for Matrix Inversion
Diagonal + column precondition matrix A = a1,1 · · · a2,1 a2,2 . . . . . . ... . . . . . . aN,1 · · · aN,N . . . ... . . . . . .
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 7
Algorithms for Matrix Inversion
Diagonal + column precondition matrix A = a1,1 · · · a2,1 a2,2 . . . . . . ... . . . . . . aN,1 · · · aN,N I − AX = y1,2 · · · y1,N y2,2bb . . . y2,N . . . ... . . . . . . yN,2 · · · yN,N
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 8
Computational Complexity
- The latency (time to obtain the result) of an
algorithm depends on two aspects:
- Total number of operations
latency scales with number of processing elements (PEs)
- Number of sequential operations
latency does not scale with number of PEs
- Pipelining of the PEs
- Increases clock frequency
- Increases latency
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 8
Computational Complexity
- The latency (time to obtain the result) of an
algorithm depends on two aspects:
- Total number of operations → latency scales with
number of processing elements (PEs)
- Number of sequential operations
latency does not scale with number of PEs
- Pipelining of the PEs
- Increases clock frequency
- Increases latency
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 8
Computational Complexity
- The latency (time to obtain the result) of an
algorithm depends on two aspects:
- Total number of operations → latency scales with
number of processing elements (PEs)
- Number of sequential operations → latency does
not scale with number of PEs
- Pipelining of the PEs
- Increases clock frequency
- Increases latency
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 8
Computational Complexity
- The latency (time to obtain the result) of an
algorithm depends on two aspects:
- Total number of operations → latency scales with
number of processing elements (PEs)
- Number of sequential operations → latency does
not scale with number of PEs
- Pipelining of the PEs
- Increases clock frequency
- Increases latency
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 9
Computational Complexity Example
4 × 4 exact matrix inversion based on LDL⊺
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 10
How Many Cycles?
- Assume multiply-and-add (MAD) operations
- Reciprocals performed using Newton-Raphson →
a number of sequential MAD operations
- Sum-of-products computed using sequential
MADs
- perations, each with
pipeline stages implemented on processing elements (PEs) require
alg latency
- cycles. (2)
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 10
How Many Cycles?
- Assume multiply-and-add (MAD) operations
- Reciprocals performed using Newton-Raphson →
a number of sequential MAD operations
- Sum-of-products computed using sequential
MADs
- perations, each with
pipeline stages implemented on processing elements (PEs) require
alg latency
- cycles. (2)
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 10
How Many Cycles?
- Assume multiply-and-add (MAD) operations
- Reciprocals performed using Newton-Raphson →
a number of sequential MAD operations
- Sum-of-products computed using sequential
MADs
- O operations, each with P pipeline stages
implemented on Q processing elements (PEs) require Calg ≥ max O Q
- + P − 1, PClatency
- cycles. (2)
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 11
Algorithm Comparison – Complexity
Method MADs Reciprocals Exact method LDL⊺+EQU
1 2N3 + 1 2N2 − N
N Neumann series Diagonal, K = 2 N2 − N N K = 3
1 2N3 + N2 − 1 2N
N Tri-diagonals, K = 2 3N2 + 7N − 10 2N − 1 K = 3
1 2N3 + 6N2 + 1 2N − 2
2N − 1
- Diag. + column, K = 2
3 2N2 + 5 2N − 4
N K = 3
1 2N3 + 5 2N2 − 2N − 1
N
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 12
Algorithm Comparison – Latency
Method MADs Reciprocals Exact method LDL⊺+EQU 4N − 4 N Neumann series Diagonal, K = 2 2 1 K = 3 N + 1 1 Tri-diagonals, K = 2 2N + 5 N K = 3 3N + 5 N
- Diag. + column, K = 2
N + 2 1 K = 3 2N + 1 1
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 13
Results
Bit-error rate for the four approaches, N = 20, M = 120
1 2 3 4 5 10-8 10-6 10-4 10-2 100
Diagonal Column Diagonal Tridiagonal LDL
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 14
Results
Reciprocal ⇒ Three sequential MAD operations 4 × 4-matrix
#PE: 1, latency: 48
20 40
Cycle
0.5 1
#Operations #PE: 2, latency: 29
5 10 15 20 25
Cycle
1 2
#Operations #PE: 3, latency: 26
5 10 15 20 25
Cycle
2 4
#Operations #PE: 4, latency: 25
5 10 15 20 25
Cycle
2 4
#Operations
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 15
Results – 16 × 16
Solid: actual result, dashed: from equation
5 10 15
Processing elements
102 103 104
Cycles
Tri-diagonal
- Col. + Diag.
Diagonal Exact
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 16
Results – 8 × 8
Solid: actual result, dashed: from equation
5 10 15
Processing elements
101 102 103
Cycles
- Col. + Diag.
Diagonal Exact
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 17
Results
With P = 1, 2, 3, 4 levels of pipelining 4 × 4-matrix
P: 1, latency: 48
20 40
Cycle
0.5 1
#Operations P: 2, latency: 57
10 20 30 40 50
Cycle
0.5 1
#Operations P: 3, latency: 77
20 40 60
Cycle
0.5 1
#Operations P: 4, latency: 98
20 40 60 80
Cycle
0.5 1
#Operations
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 18
Results – 16 × 16
Time in single cycle latency operations, assuming pipelining increases speed linearly Solid: P = 1, dashed: P = 2, dash-dotted: P = 3
1 2 3 4
Processing elements
102 103
Time
- Col. + Diag.
Diagonal Exact
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 19
Results – 8 × 8
Time in single cycle latency operations, assuming pipelining increases speed linearly Solid: P = 1, dashed: P = 2, dash-dotted: P = 3
1 2 3 4
Processing elements
101 102
Time
- Col. + Diag.
Diagonal Exact
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20
Design Example
- Assume a latency requirement of 0.05 ms (10% of
an LTE-like frame with 2 UL and 2 DL symbols)
- For
and one PE, cycles are required for the exact algorithm
- One PE operating at
clk
MHz
- clk
MHz
- 2 kInv/s, idle 90% of the time
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20
Design Example
- Assume a latency requirement of 0.05 ms (10% of
an LTE-like frame with 2 UL and 2 DL symbols)
- For N = 8 and one PE, 304 cycles are required for
the exact algorithm
- One PE operating at
clk
MHz
- clk
MHz
- 2 kInv/s, idle 90% of the time
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20
Design Example
- Assume a latency requirement of 0.05 ms (10% of
an LTE-like frame with 2 UL and 2 DL symbols)
- For N = 8 and one PE, 304 cycles are required for
the exact algorithm
- One PE operating at fclk = 6.08 MHz
- clk
MHz
- 2 kInv/s, idle 90% of the time
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20
Design Example
- Assume a latency requirement of 0.05 ms (10% of
an LTE-like frame with 2 UL and 2 DL symbols)
- For N = 8 and one PE, 304 cycles are required for
the exact algorithm
- One PE operating at fclk = 6.08 MHz
- N = 30 ⇒ fclk ≈ 280 MHz
- 2 kInv/s, idle 90% of the time
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 20
Design Example
- Assume a latency requirement of 0.05 ms (10% of
an LTE-like frame with 2 UL and 2 DL symbols)
- For N = 8 and one PE, 304 cycles are required for
the exact algorithm
- One PE operating at fclk = 6.08 MHz
- N = 30 ⇒ fclk ≈ 280 MHz
- 2 kInv/s, idle 90% of the time
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 21
Is Neumann useful at all?
- If less than three terms are used, the complexity
may be lower
- Only compute parts of the third iteration
- Allow increasing the number of terminals further
- But numerically most efficient when the ratio
between number of antennas and terminals is high
- May give a better result with singular or close to
singular matrices (not correct result maybe not as bad as an exact algorithm)
- (Really) large matrices
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 21
Is Neumann useful at all?
- If less than three terms are used, the complexity
may be lower
- Only compute parts of the third iteration
- Allow increasing the number of terminals further
- But numerically most efficient when the ratio
between number of antennas and terminals is high
- May give a better result with singular or close to
singular matrices (not correct result maybe not as bad as an exact algorithm)
- (Really) large matrices
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 21
Is Neumann useful at all?
- If less than three terms are used, the complexity
may be lower
- Only compute parts of the third iteration
- Allow increasing the number of terminals further
- But numerically most efficient when the ratio
between number of antennas and terminals is high
- May give a better result with singular or close to
singular matrices (not correct result maybe not as bad as an exact algorithm)
- (Really) large matrices
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 21
Is Neumann useful at all?
- If less than three terms are used, the complexity
may be lower
- Only compute parts of the third iteration
- Allow increasing the number of terminals further
- But numerically most efficient when the ratio
between number of antennas and terminals is high
- May give a better result with singular or close to
singular matrices (not correct result maybe not as bad as an exact algorithm)
- (Really) large matrices
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
- Latency, not throughput
- Complexity for Neumann series with
higher than best exact algorithm
- Few terms for Neumann when diagonally
dominant
- Diagonally dominant
well conditioned exact algorithm behaves well
- Few terminals
more diagonally dominant fewer Neumann terms (but also less complexity for exact algorithm)
- With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem
- Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
- Latency, not throughput
- Complexity for Neumann series with K = 3 higher
than best exact algorithm
- Few terms for Neumann when diagonally
dominant
- Diagonally dominant
well conditioned exact algorithm behaves well
- Few terminals
more diagonally dominant fewer Neumann terms (but also less complexity for exact algorithm)
- With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem
- Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
- Latency, not throughput
- Complexity for Neumann series with K = 3 higher
than best exact algorithm
- Few terms for Neumann when diagonally
dominant
- Diagonally dominant
well conditioned exact algorithm behaves well
- Few terminals
more diagonally dominant fewer Neumann terms (but also less complexity for exact algorithm)
- With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem
- Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
- Latency, not throughput
- Complexity for Neumann series with K = 3 higher
than best exact algorithm
- Few terms for Neumann when diagonally
dominant
- Diagonally dominant ⇒ well conditioned
exact algorithm behaves well
- Few terminals
more diagonally dominant fewer Neumann terms (but also less complexity for exact algorithm)
- With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem
- Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
- Latency, not throughput
- Complexity for Neumann series with K = 3 higher
than best exact algorithm
- Few terms for Neumann when diagonally
dominant
- Diagonally dominant ⇒ well conditioned ⇒ exact
algorithm behaves well
- Few terminals ⇒ more diagonally dominant
fewer Neumann terms (but also less complexity for exact algorithm)
- With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem
- Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
- Latency, not throughput
- Complexity for Neumann series with K = 3 higher
than best exact algorithm
- Few terms for Neumann when diagonally
dominant
- Diagonally dominant ⇒ well conditioned ⇒ exact
algorithm behaves well
- Few terminals ⇒ more diagonally dominant ⇒
fewer Neumann terms (but also less complexity for exact algorithm)
- With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem
- Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
- Latency, not throughput
- Complexity for Neumann series with K = 3 higher
than best exact algorithm
- Few terms for Neumann when diagonally
dominant
- Diagonally dominant ⇒ well conditioned ⇒ exact
algorithm behaves well
- Few terminals ⇒ more diagonally dominant ⇒
fewer Neumann terms (but also less complexity for exact algorithm)
- With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem
- Required latency/parallelism determined by frame
structure
Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 22
Conclusions
- Latency, not throughput
- Complexity for Neumann series with K = 3 higher
than best exact algorithm
- Few terms for Neumann when diagonally
dominant
- Diagonally dominant ⇒ well conditioned ⇒ exact
algorithm behaves well
- Few terminals ⇒ more diagonally dominant ⇒
fewer Neumann terms (but also less complexity for exact algorithm)
- With few PEs compared to matrix size, the limited
parallelism of the exact algorithm is no problem
- Required latency/parallelism determined by frame