energy efficient adaptive beamforming on sensor networks
play

Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. - PowerPoint PPT Presentation

Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna http://pacman.usc.edu


  1. Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna http://pacman.usc.edu

  2. Outline � Problem Definition � Computational Characteristics � Prior Solution � Power Optimizations � Sensor Node Level � Inter Node Level � Challenges/Discussion 1

  3. Problem Scenario Energy Constrained Network Passive Active 2

  4. Beamforming Def: The technique which spatially filters the signals received from an array of sensors and estimates the spatial features of the sources Procedure: 1. passively and repeatedly sample acoustic propagation wave field signals 2. input data, linearly combined with a weight matrix to form a sonar beam for a particular direction of look Adaptive Sonar Beamforming: For High SNR and High resolution Time changing signal and noise properties included in the derivation of weights, making them adapt accordingly 3

  5. Space Time Adaptive Processing Range gates 1 2 L Elements 1 N Pulse Repetition Interval N s I R P L M Each CPI Target Detection (Coherent Processing Interval) 4

  6. MITRE RT_STAP Benchmark Preprocessing Step 2 Preprocessing Step 1 Input Data L .. .. (1920) N (22) M (64) . . Doppler Weight Weight Processing Computation Application T latency = 161.25 msec & T period = 32.25 msec 5

  7. Elements (N = 22) (M = 64) PRIs Input Data Cube (L = 1920) Gates Range 6

  8. Sonar Signal Processing Adaptive Beamforming Sampling Rate Output Rate Frequency Domain =10 Hz~25 KHz =1 Hz~100 Hz Element Adaptive FFT Space Beam- forming Adaptive Beam Space FFT Beam- forming Conventional 100 ~5000 Beams per Output Beamforming Time Domain 7

  9. An Example Adaptive Beamformer MVDR (Minimum Variance Distortionless Response) Frequency Bins s Channel F Corner FFT N N Turn F Beams per Bin N B F N Factorization N F N F Linear Solver & Beamformer Steering Covariance B 8

  10. Computational Characteristics D D A D D A T A A T S 1 S 2 S 3 S 4 A T T A A A Outputs Initial Data Layout � Overall processing consists of sequence of subproblems � Computational requirements are different for each subproblem � Large amount of data is repeatedly processed in real-time � Data access patterns change from subproblem to subproblem � Throughput and latency performance requirements 9

  11. Adaptive Processing Key Problems � Doppler Processing (FFT) � Weight Computation apply (Co Variance matrix factorization) � Weight Application adaptation Gates (Matrix Vector Product) Range Elements (N = 22) (L = 1920) PRIs (M = 64) 10

  12. Prior Solution Architecture= tightly coupled collection of processors Target detection High bandwidth, low latency network 11

  13. Key Issue: Communication Cost Coarse grain machines : Powerful processing nodes - T3E: Typical Configuration -SP-2: Typical Configuration •1200 Mflops/node (T3E- 1200) • 640 Mflops/node • Local Memory Access Time: • 64 MB – 4 GB Memory 87 ~ 253 nsec • 4.5 – 36.2 GB Internal Disk • Global Memory Access Time: µ µ µ µ 1~2 sec (SHMEM) � Large software overhead for message transfer - SP-2: ~39 µ sec overhead/message using MPL/MPI ~ 9 nsec/byte/node transfer rate - local memory access: 100’s of nsec 12

  14. Key Idea- Data Remapping P 0 P 3 P 0 P 3 P 0 P 3 Data Access Pattern S 1 S 2 S 3 Remap? Remap? Benefits of Remapping Must Exceed the Overhead 13

  15. Impact of Data Remapping Our Results Results reported in IPPS ‘95 Implementation performed on IBM SP-2 at MHPCC Code developed using C, MPI and ESSL 14

  16. Lessons learnt Objective : Adaptive beamforming on parallel machines � Task level parallelism � Minimize communication cost � Data Remapping 15

  17. Energy Efficiency � Energy Constrained � Network Power is critical and must be conserved � Sensors � Reduce power dissipation at sensor node level � energy efficient algorithms � Decrease power dissipation at inter-node level � Optimize on communication cost between sensors � 16

  18. Power Model for a Processing Element Frequency Frequency Control Control f p f b Processor Processor Memory FU Cache FU Cache Power Total = Power Processor +Power Data bus + Power Memory Power unit = Power Dynamic + Power Static = 0.5f(n)CV 2 f Active + VI Leakage F max ∝ (V-V t )/V 17

  19. Reduce Processor-Memory Data Traffic Instructions for Memory access consume lot of power Instruction Energy (10 -8 Joules) (Intel 486DX2) MOV DX BX 2.49 MOV DX [BX] 3.53 MOV [BX] DX 4.30 Reduce # of memory accesses � reduce cache misses high data reuse in cache � use registers � Reduce power consumed on the data bus 18

  20. Example: Matrix Multiplication Cache size =n j k j i x i k A B C Do i = 0 ; Do j = 0 ; A[i,j] � � 0 ; � � Do k = 0 ; A[i, j] � � A[i,j] + B[i,k] x C[k,j] ; � � k++; j++; i++ ; ≈ Energy = α n 3 + β (n+n 2 )n + γ (3n 2 ) ( α + β )n 3 Time = n 3 + lower order terms 19

  21. Optimization I: Reduce Bus Traffic Block Matrix Multiply n n n n x Energy = α n 3 + 2 β (n.n 1 /2 )n + γ (3n 2 ) Time = n 3 + lower order terms 20

  22. Optimization II: Reduce Peak Bus Bandwidth A B C n n n n n n 3 2 n 2 Data = 2 n Time = 1 Bus Data Rate ∝ Processor Rate! n 21

  23. Optimization III: Application directed Data Layouts � Applications have different data access patterns � Matrices accessed by rows, columns, diagonals, sub-squares � Tree structures accessed along paths, sub-trees � “Naive” data layouts degrade performance � Large working sets cause capacity misses � Improper alignment in memory causes conflict misses Row major Layout Block Layout a 0,2 a 0,3 a 0,0 a 0,1 a 0,0 a 0,1 a 0,2 a 0,3 a 1,0 a 1,1 a 1,2 a 1,3 a 1,0 a 1,0 a 1,2 a 1,3 a 2,0 a 2,1 a 2,2 a 2,3 a 2,0 a 2,1 a 2,2 a 2,3 a 3,0 a 3,1 a 3,2 a 3,3 a 3,0 a 3,1 a 3,2 a 3,3 Page 0 Page 2 Page 2 Page 0 Page 1 Page 3 Page 1 Page 3 22

  24. Cache Friendly Algorithms Cache friendly � High data reuse � Low cache pollution � Regular access patterns Data layouts � Static data layouts (Matrix Multiply) � Dynamic data layouts (FFT) 23

  25. Fast Fourier Transform DFT: Cooley-Tukey Algorithm � Compute DFT of size N = N 1 *N 2 � Step1: compute N 2 DFTs of size N 1 � Step2: multiply twiddle factors � Step3: compute N 1 DFTs of size N 2 � Divide and conquer recursively Current Approach � MIT FFTW � Determine optimal factorization � Perform low level optimizations for kernels � Construct larger size FFTs from kernels � Key Assumption � All DFTs of same size have same execution time 24

  26. Problem with Current Approach All N-point DFTs do not have the same cost! � different data access patterns with various strides � stride affects execution time 32-point FFT with Strided Access - Experimental Results N = 32 70 60 Execution Time 50 (usec) 40 30 20 10 0 0 5 10 15 20 Stride (2^s) Sun Ultra 1: 167MHz, L2 Cache = 512 KB = 32 K points 25

  27. Our Approach Reorganize input data layout to change non-unit stride to unit stride Dynamic Data Layout Perform data reorganization during computation N 1 N 2 N 2 -point FFTs N 1 -point FFTs Data Reorganization 26

  28. Example FFTW USC approach Decomposition trees for a 1024*1024 point FFT 1611.125 ms 1039.6496 ms 54.96% improvement over state-of-the-art FFTW package on DEC Alpha 27

  29. Other Techniques for Node Level Power Optimizations ? � Voltage frequency scaling f max α (V-V t )/V � Power management (idle/sleep/active states) � Reduce precision Instruction Energy (10 -8 Joules) (Fujitsu Sparc‘934) � Clock Gating OR 3.26 MUL 3.26 28

  30. Current Work � Development and Verification of techniques proposed for power optimization � Existing simulators Simple Power(based on Simple Scalar architecture) � Joule Track (Code Length Limitations) � � Board level Power Measurements Brutus Evaluation Board (SA-1100) � � Build a functional level power simulation Fast with acceptable level of accuracy. � Develop a multiprocessor power model � 31

  31. Space Time Representation A ⊗ B for N x N matrices Compute results in each block B 11 B 12 B 1N � Schedule blocks row-major � N 2 steps … A 11 A 12 c c c � Data per step ∝ N √ c … … � Operations per step ∝ Nc � Data reuse per step ∝ √ c … A 1N � Total traffic ∝ N 2 * N √ c = N 3 = computation for result (i,j) c √ c c = cache size 33

  32. Theorem Unidirectional Space-Time representation leads to cache friendly algorithms => Energy Efficient Algorithms 34

  33. Network level Energy Optimization � Computation cost is much lower than communication cost � Radio interface consumes a large amount of power POWER WINS sensor Consumed Node Transmission(100m) 600mw (at 100kbits/sec) Reception 300mw Processor (SA1100) 250MIPS/watt � Energy to transfer 32 bits over 100m in WINS sensor node =( (600 +300)mw ÷ 100kbits/s) x 32 = 288 x 10 –6 Joules � Energy to execute a 32 bit instruction using SA1100 processor = 1 ÷ 250 MIPS/watt = 0.004 x 10 –6 Joules � Additional overhead for bits added for error correction � Retransmissions are frequent due to unreliable links(e.g.wireless) 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend