 
              A Parallel Method for the Computation of Matrix Exponential based on Truncated Neumann Series V. S. Dimitrov 12 , V. Ariyarathna 3 , D. F. G. Coelho 1 , L. Rakai 1 , A. Madanayake 3 , R. J. Cintra 4 1 ECE Department, University of Calgary, Canada 2 Computer Modelling Group, Canada 3 ECE Department, University of Akron, USA 4 Statistics Department, Universidade Federal de Pernambuco, Brazil July 20, 2017 Coelho and Dimitrov (UofC) July 20, 2017 1 / 21
Introduction Problems in many areas require the solution of sets of linear, constant coefficient differential equations in the form: x ( t ) = Ax ( t ) = ˙ ⇒ x ( t ) = exp ( t A ) x 0 When multiple inputs are used for the same system, it might be advantageous compute the matrix exponential. Coelho and Dimitrov (UofC) July 20, 2017 2 / 21
Methods for Matrix Exponential Series expansion: ◮ Taylor; ◮ Padé; ◮ Scaling & Squaring. Newton Interpolation; Cayley-Hamilton method; Eigenvectors decomposition. Coelho and Dimitrov (UofC) July 20, 2017 3 / 21
Matrix Exponential Series Expansion The problem can be treated as the evaluation of a polynomial. Existing methods: ◮ Horner rule; ◮ Estrin method; ◮ Binary tree. Coelho and Dimitrov (UofC) July 20, 2017 4 / 21
Conventions Let A be a square matrix of size n × n . Let p N ( · ) be a polynomial of degree N − 1 over the real numbers. Let also g N ( · ) be a geometric series with N terms. Coelho and Dimitrov (UofC) July 20, 2017 5 / 21
Definition The critical path associated with the computation of a matrix polynomial p N ( A ) is the largest chain of matrix multiplications (MM) in order to evaluate p N ( A ) . Definition (Critical Path for Matrix Polynomial) Horner rule: N − 1 MM; Estrin method: 2 log 2 ( N − 1 ) MM; Binary tree: 2 log 2 ( N − 1 ) MM. Coelho and Dimitrov (UofC) July 20, 2017 6 / 21
Geometric Series Geometric series of matrix arguments can be computed efficiently with the use of different polynomial factorizations. � ( I + A 2 ) · g N / 2 ( A 2 ) , if N ≡ 0 mod 2 g N ( A ) = I + ( A + A 2 ) · g ( N − 1 ) / 2 ( A 2 ) , if N ≡ 1 mod 2 .  ( I + A + A 2 ) · g N / 3 ( A 3 ) , if N ≡ 0 mod 3 ,  I + ( A + A 2 + A 3 ) · g ( N − 1 ) / 3 ( A 3 ) , g N ( A ) = if N ≡ 1 mod 3 , I + A + ( A 2 + A 3 + A 4 ) · g ( N − 2 ) / 3 ( A 3 ) ,  if N ≡ 2 mod 3 . In general, the use of basis P demands P ′ log 2 ( N ) − 2 < 2 log 2 ( N ) − 2. Coelho and Dimitrov (UofC) July 20, 2017 7 / 21
Geometric Series In general, the use of basis P demands P ′ log 2 ( N ) − 2 < 2 log 2 ( N ) − 2. Examples: Basis 2: 2 log 2 ( N ) − 2; Basis 3: 1 . 89 . . . log 2 ( N ) − 2; Basis 5: 1 . 72 . . . log 2 ( N ) − 2; Basis 6: 1 . 92 . . . log 2 ( N ) − 2; Basis 26: 1 . 70 . . . log 2 ( N ) − 2. Coelho and Dimitrov (UofC) July 20, 2017 8 / 21
The Matrix Exponential as Several Neumann Series We write the matrix exponential truncated series expansion p N ( A ) as a linear combination of different geometric series on α k A , k = 0 , 1 , . . . , N − 1: N − 1 � p N ( A ) = g n + 1 ( α n A ) n = 0 N − 1 � n − 1 � � � α k n A k = n = 0 k = 0 N − 1 � N − 1 � � � α n · A n . = k n = 0 k = n Coelho and Dimitrov (UofC) July 20, 2017 9 / 21
The Matrix Exponential as Several Neumann Series If the coefficients of p N ( · ) are p 0 , p 1 , . . . , p N − 1 , we have he system α 0 + α 1 + α 2 + . . . + α N − 1 = p 0 α 2 1 + α 2 2 + . . . + α 2 N − 1 = p 1 α 3 2 + . . . + a 3 N − 1 = p 2 . . . α N N − 1 = p N − 1 . This system has several complex solutions that can be found by back substitution. Coelho and Dimitrov (UofC) July 20, 2017 10 / 21
A Numerical Example Small degree polynomials does not require complex solutions. Considering N = 4, we have α 0 = 0 . 451801 α 1 = 0 . 420627 α 2 = 0 . 344837 α 3 = − 0 . 217308 Coelho and Dimitrov (UofC) July 20, 2017 11 / 21
Another Numerical Example Table: Calculated coefficients for N = 9. Coefficient Value 0 . 265650069930254 α 8 0 . 270164634258582 α 7 α 6 0 . 294213807881822 α 5 0 . 320211897615876 0 . 339931939777992 α 4 0 . 312847569943447 α 3 − 0 . 803019919407973 β 1 β 0 − 0 . 046083639081852 Coelho and Dimitrov (UofC) July 20, 2017 12 / 21
A Different Approach If we modify the formulation to N − 1 N − 1 � N − 1 � � � � α n · A n . p N ( A ) = g N ( α n A ) = k n = 0 n = 0 k = 0 we obtain α 0 + α 1 + α 2 + . . . + α N − 1 = 1 N − 1 = 1 α 2 0 + α 2 1 + α 2 2 + . . . + α 2 2 . . . 1 α N − 1 + α N − 1 + α N − 1 + . . . + α N − 1 N − 1 = 0 1 2 ( N − 1 )! Coelho and Dimitrov (UofC) July 20, 2017 13 / 21
Algorithmic Example for N = 9 Pre Computation: B = A 2 , C = B 2 , broadcast A , B , C , β 0 , and β 1 Processor 0 computes H 9 ( A ) ← β 0 A + β 1 B − ( N − 4 ) I Processor 1 computes g 4 ( α 3 A ) ← ( I + α 3 A )( I + α 2 3 B ) Processor 2 computes g 5 ( α 4 A ) ← I + ( α 4 A + α 2 4 B )( I + α 2 4 B ) Processor 3 computes g 6 ( α 5 A ) ← ( I + α 5 A )( I + α 2 5 B + α 4 5 C ) Processor 4 computes g 7 ( α 6 A ) ← I + ( α 6 A + α 2 6 B )( I + α 2 6 B + α 4 6 C ) Processor 5 computes g 8 ( α 7 A ) ← ( I + α 7 A )( I + α 2 7 B )( I + α 4 7 C ) Processor 6 computes g 9 ( α 8 A ) ← I + ( α 8 A + α 2 8 B )( I + α 2 8 B )( I + α 4 8 C ) Return E 9 ( A ) = � 9 n = 3 g n + 1 ( α n A ) + H 9 ( A ) Figure: Fragment of the algorithm for computing E 9 ( A ) . Coelho and Dimitrov (UofC) July 20, 2017 14 / 21
Computing Time Trade-Off in Software Error × 10 -4 m =10 Time 10 0 2 expm time 10 -5 1 10 -10 0 × 10 -3 m =100 10 0 2 Time (s) Error 10 -5 1 10 -10 0 m =1000 10 0 0.4 10 -5 0.2 10 -10 0 1 2 3 4 5 6 7 8 9 N Figure: Illustration of the accuracy versus computing time trade-off for different values of N and m . Coelho and Dimitrov (UofC) July 20, 2017 15 / 21
Hardware Realization 4 H 9 ( A ) 4 4 S/P G 4 ( α 3 A ) a 12 a 11 4 Re−arrange 4 G 5 ( α 4 A ) A · A E 9 ( A ) a 22 a 21 Addition block 4 G 6 ( α 5 A ) 2 t = 1 t = 0 P/S 4 G 7 ( α 6 A ) 4 4 G 8 ( α 7 A ) A 2 · A 2 4 G 9 ( α 8 A ) Figure: Top level view of the implementation of the proposed algorithm. Coelho and Dimitrov (UofC) July 20, 2017 16 / 21
Hardware Realization v 22 v 12 v 21 v 11 D D u 12 u 11 w 11 w 12 u 21 u 22 D D t = 1 t = 0 w 21 w 22 Figure: Multiplication block for 2 × 2 matrices. Coelho and Dimitrov (UofC) July 20, 2017 17 / 21
Hardware Realization Results: FPGA Table: Timing and resource consumption comparison for Xilinx xc6vlx240t-ff1156 FPGA Figure of merit Horner’s Rule New Algorithm Latency 16 6 (clock cycles) Critical path 14.392 11.160 delay (ns) Slice LUTs used 24537 90993 No. of adders 48 118 No. of multipliers 32 112 Coelho and Dimitrov (UofC) July 20, 2017 18 / 21
Hardware Realization Results: ASIC Table: ASIC synthesis results Horner’s New Percentage Figure of merit Rule Algorithm Change T (ns) 3.556 1.350 ↓ 62.04 % Occupied area 1.355 4.080 ↑ 201.10% (A, mm 2 ) Dynamic power 3199.088 2403.661 ↓ 24.86% ( mW / GHz ) AT ( mm 2 · ns ) 4.8175 5.5083 ↑ 14.34% AT 2 ( mm 2 · ns 2 ) 17.1313 7.4363 ↓ 56.59% Max frequency 0.2812 0.7407 ↑ 163.40% (GHz) Latency 16 6 ↓ 62.5 % (Clock cycles) Total gate count 61009 131874 ↑ 116.15% Coelho and Dimitrov (UofC) July 20, 2017 19 / 21
Final Comments Advantages: ◮ the proposed method reduce critical path; Disadvantages: ◮ requires more processors and memory (software); ◮ requires more hardware resources such as LUT and gates (hardware). Future works: ◮ consider different combinations of Neumann series for different solutions (real solution possible?); ◮ consider more matrix functions and general polynomials; ◮ provide accurate error analysis. Coelho and Dimitrov (UofC) July 20, 2017 20 / 21
Questions? Coelho and Dimitrov (UofC) July 20, 2017 21 / 21
Recommend
More recommend