Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. - - PowerPoint PPT Presentation

energy efficient adaptive beamforming on sensor networks
SMART_READER_LITE
LIVE PREVIEW

Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. - - PowerPoint PPT Presentation

Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna http://pacman.usc.edu


slide-1
SLIDE 1

Energy Efficient Adaptive Beamforming

  • n Sensor Networks

Viktor K. Prasanna Bhargava Gundala, Mitali Singh

  • Dept. of EE-Systems

University of Southern California email: prasanna@usc.edu http://ceng.usc.edu/~prasanna http://pacman.usc.edu

slide-2
SLIDE 2

Outline

Problem Definition Computational Characteristics Prior Solution Power Optimizations

Sensor Node Level Inter Node Level

Challenges/Discussion

1

slide-3
SLIDE 3

Problem Scenario

Energy Constrained Network

Passive Active

2

slide-4
SLIDE 4

Beamforming

Def: The technique which spatially filters the signals received from an array of sensors and estimates the spatial features of the sources Procedure:

  • 1. passively and repeatedly sample acoustic

propagation wave field signals

  • 2. input data, linearly combined with a weight matrix

to form a sonar beam for a particular direction of look Adaptive Sonar Beamforming: For High SNR and High resolution Time changing signal and noise properties included in the derivation of weights, making them adapt accordingly

3

slide-5
SLIDE 5

Space Time Adaptive Processing

Target Detection

Each CPI (Coherent Processing Interval) 1 2 L Range gates 1 N Elements Pulse Repetition Interval N L M P R I s

4

slide-6
SLIDE 6

MITRE RT_STAP Benchmark

N (22) M (64) (1920) L

.. .. . .

Tlatency = 161.25 msec & Tperiod = 32.25 msec

Input Data Preprocessing Step 1 Preprocessing Step 2 Weight Application Weight Computation Doppler Processing

5

slide-7
SLIDE 7

Input Data Cube

PRIs (M = 64) Range Elements (N = 22) Gates (L = 1920)

6

slide-8
SLIDE 8

Sonar Signal Processing

Adaptive Beamforming

FFT Adaptive Sampling Rate Conventional Beamforming Beam- FFT Adaptive Beam- 100 ~5000 Beams Frequency Domain Time Domain per Output Output Rate =1 Hz~100 Hz Element forming forming Space Beam Space =10 Hz~25 KHz

7

slide-9
SLIDE 9

An Example Adaptive Beamformer

MVDR (Minimum Variance Distortionless Response)

Frequency Bins s

FFT Corner Linear Solver

N F N F F N F

Factorization

Covariance Steering

B F N B N N

& Beamformer Turn

Channel Beams per Bin

8

slide-10
SLIDE 10

Computational Characteristics

S1 S2 S3 S4 D A T A D A T A D A T A D A T A

Outputs Initial Data Layout

Overall processing consists of sequence of subproblems Computational requirements are different for each subproblem Large amount of data is repeatedly processed in real-time Data access patterns change from subproblem to subproblem Throughput and latency performance requirements

9

slide-11
SLIDE 11

Adaptive Processing

PRIs (M = 64) Range Elements (N = 22) Gates (L = 1920)

Key Problems

Doppler Processing (FFT) Weight Computation (Co Variance matrix factorization) Weight Application (Matrix Vector Product)

adaptation apply

10

slide-12
SLIDE 12

Prior Solution

Architecture= tightly coupled collection of processors

High bandwidth, low latency network

Target detection

11

slide-13
SLIDE 13

Key Issue: Communication Cost

Coarse grain machines : Powerful processing nodes

  • SP-2: Typical Configuration
  • 640 Mflops/node
  • 64 MB – 4 GB Memory
  • 4.5 – 36.2 GB Internal Disk
  • T3E: Typical Configuration
  • 1200 Mflops/node (T3E- 1200)
  • Local Memory Access Time:

87 ~ 253 nsec

  • Global Memory Access Time:

1~2 sec (SHMEM)

Large software overhead for message transfer

  • SP-2: ~39 µsec overhead/message using MPL/MPI

~ 9 nsec/byte/node transfer rate

  • local memory access: 100’s of nsec

µ µ µ µ

12

slide-14
SLIDE 14

Key Idea- Data Remapping

S 1 S2 S3

Data Access Pattern

P0 P 0 P0 P 3 P3 P3

Remap?

Remap?

Benefits of Remapping Must Exceed the Overhead

13

slide-15
SLIDE 15

Impact of Data Remapping

Implementation performed on IBM SP-2 at MHPCC Code developed using C, MPI and ESSL

Our Results Results reported in IPPS ‘95

14

slide-16
SLIDE 16

Lessons learnt

Objective : Adaptive beamforming on parallel machines Task level parallelism Minimize communication cost Data Remapping

15

slide-17
SLIDE 17

Energy Efficiency

Power is critical and must be conserved Reduce power dissipation at sensor node level

energy efficient algorithms

Decrease power dissipation at inter-node level

Optimize on communication cost between sensors

Energy Constrained

Network

Sensors

16

slide-18
SLIDE 18

Power Model for a Processing Element

Power Total = Power Processor +Power Data bus + Power Memory Power unit = Power Dynamic + Power Static = 0.5f(n)CV2fActive + VI Leakage Fmax ∝ (V-Vt)/V

Memory Frequency Control

Processor Processor

FU FU Cache Cache Frequency Control fp fb

17

slide-19
SLIDE 19

Reduce Processor-Memory Data Traffic

Instructions for Memory access consume lot of power

Reduce # of memory accesses reduce cache misses

  • high data reuse in cache
  • use registers

Reduce power consumed on the data bus

4.30 MOV [BX] DX 3.53 MOV DX [BX] 2.49 MOV DX BX Energy (10-8 Joules) Instruction (Intel 486DX2)

18

slide-20
SLIDE 20

Example: Matrix Multiplication

Do i = 0 ; Do j = 0 ; A[i,j]

  • 0 ;

Do k = 0 ;

A[i, j]

  • A[i,j] + B[i,k] x C[k,j] ;

k++; j++; i++ ; Energy = αn3 + β(n+n2)n + γ(3n2) (α + β)n3 Time = n3 + lower order terms

A B

i j i k

C

k j

x

Cache size =n

19

slide-21
SLIDE 21

Optimization I: Reduce Bus Traffic

Block Matrix Multiply Energy = αn3 + 2β(n.n1/2)n + γ(3n2) Time = n3 + lower order terms

x

n n n n

20

slide-22
SLIDE 22

Optimization II: Reduce Peak Bus Bandwidth

Time = Data =

n n

n 1

n

Bus Data Rate ∝ Processor Rate!

n n n

2 3

2n

2

n

A B C

21

slide-23
SLIDE 23

Optimization III: Application directed Data Layouts

Applications have different data access patterns Matrices accessed by rows, columns, diagonals, sub-squares

Tree structures accessed along paths, sub-trees

“Naive” data layouts degrade performance Large working sets cause capacity misses

Improper alignment in memory causes conflict misses

Block Layout Row major Layout a 0,0 a 1,0 a 2,0 a 3,0 a 0,1 a 1,1 a 2,1 a 3,1 a 0,2 a 1,2 a 2,2 a 3,2 a 0,3 a 1,3 a 2,3 a 3,3 Page 0 Page 1 Page 2 Page 3 a 0,0 a 1,0 a 2,0 a 3,0 a 0,1 a 2,1 a 3,1 a 0,2 a 1,2 a 2,2 a 3,2 a 0,3 a 1,3 a 2,3 a 3,3 Page 0 Page 1 Page 2 Page 3 a 1,0 22

slide-24
SLIDE 24

Cache Friendly Algorithms

Cache friendly High data reuse Low cache pollution Regular access patterns Data layouts

Static data layouts (Matrix Multiply) Dynamic data layouts (FFT)

23

slide-25
SLIDE 25

Fast Fourier Transform

DFT: Cooley-Tukey Algorithm

Compute DFT of size N = N1*N2

Step1: compute N2 DFTs of size N1 Step2: multiply twiddle factors Step3: compute N1 DFTs of size N2

Divide and conquer recursively

Current Approach

MIT FFTW

Determine optimal factorization Perform low level optimizations for kernels Construct larger size FFTs from kernels

Key Assumption

All DFTs of same size have same execution time

24

slide-26
SLIDE 26

Problem with Current Approach

All N-point DFTs do not have the same cost! different data access patterns with various strides stride affects execution time

Sun Ultra 1: 167MHz, L2 Cache = 512 KB = 32 K points

32-point FFT with Strided Access - Experimental Results

N = 32

10 20 30 40 50 60 70 5 10 15 20 Stride (2^s) Execution Time (usec)

25

slide-27
SLIDE 27

Our Approach

Reorganize input data layout to change non-unit stride to unit stride Dynamic Data Layout Perform data reorganization during computation

N1-point FFTs N2 Data Reorganization N2-point FFTs N1

26

slide-28
SLIDE 28

Example

FFTW USC approach

1611.125 ms 1039.6496 ms

54.96% improvement over state-of-the-art FFTW package on DEC Alpha

Decomposition trees for a 1024*1024 point FFT

27

slide-29
SLIDE 29

Other Techniques for Node Level Power Optimizations ?

Voltage frequency scaling f max α (V-Vt )/V Power management (idle/sleep/active states) Reduce precision Clock Gating

3.26 MUL 3.26 OR Energy (10-8 Joules) Instruction (Fujitsu Sparc‘934)

28

slide-30
SLIDE 30

Current Work

Development and Verification of techniques proposed for power optimization Existing simulators

  • Simple Power(based on Simple Scalar architecture)
  • Joule Track (Code Length Limitations)

Board level Power Measurements

  • Brutus Evaluation Board (SA-1100)

Build a functional level power simulation

  • Fast with acceptable level of accuracy.
  • Develop a multiprocessor power model

31

slide-31
SLIDE 31

Space Time Representation

… …

c c B11 B12 B1N A11 A12 A1N

A ⊗ B for N x N matrices

= computation for result (i,j) c = cache size

… … √c c c Compute results in each block Schedule blocks row-major

N2 steps Data per step ∝ N√c Operations per step ∝ Nc Data reuse per step ∝ √c Total traffic ∝ N2 * N√c = N3

33

slide-32
SLIDE 32

Theorem Unidirectional Space-Time representation leads to cache friendly algorithms => Energy Efficient Algorithms

34

slide-33
SLIDE 33

Network level Energy Optimization

Computation cost is much lower than communication cost Radio interface consumes a large amount of power Energy to transfer 32 bits over 100m in WINS sensor node =( (600 +300)mw ÷ 100kbits/s) x 32 = 288 x 10 –6 Joules Energy to execute a 32 bit instruction using SA1100 processor = 1 ÷ 250 MIPS/watt = 0.004 x 10 –6 Joules Additional overhead for bits added for error correction Retransmissions are frequent due to unreliable links(e.g.wireless)

300mw Reception 250MIPS/watt Processor (SA1100) Transmission(100m) POWER Consumed 600mw (at 100kbits/sec) WINS sensor Node

29

slide-34
SLIDE 34

Reduce Communication Cost

Exploit data redundancy to reduce data traffic Improve locality of computation while assigning subtasks to node Communication limited to closely placed nodes

Larger distance requires higher transmission power Reduces reliability of link

30

slide-35
SLIDE 35

Network Level Power Optimization Issues

Topology of network is unknown

Estimation of Communication cost Task allocation

Broadcast Communication Model Need: Framework for Energy Efficient Computation in Adhoc Networks

32