Real-Time High-Throughput Sonar Beamforming Kernels Using Native - - PowerPoint PPT Presentation

real time high throughput sonar beamforming kernels using
SMART_READER_LITE
LIVE PREVIEW

Real-Time High-Throughput Sonar Beamforming Kernels Using Native - - PowerPoint PPT Presentation

Real-Time High-Throughput Sonar Beamforming Kernels Using Native Signal Processing and Memory Latency Hiding Techniques Gregory E. Allen 1 1 Brian L. Evans Lizy K. John Department of Electrical and Computer Engineering The University of


slide-1
SLIDE 1

Real-Time High-Throughput Sonar Beamforming Kernels Using Native Signal Processing and Memory Latency Hiding Techniques

Gregory E. Allen1 Brian L. Evans Lizy K. John

Department of Electrical and Computer Engineering The University of Texas at Austin

http://www.ece.utexas.edu/~allen/ 1

slide-2
SLIDE 2

Introduction

  • Sonar beamforming is computationally intensive

2

  • GFLOPS of computation
  • 100 MB/s of data input/output
  • Current real-time implementation technologies
  • Custom hardware
  • Custom integration using commercial-off-the-shelf (COTS)

processors (e.g. 100 digital signal processors in a VME chassis)

  • Low production volume (50 units), high development cost
  • Examine performance of commodity computers
  • Native signal processing, multimedia instruction sets
  • Memory latency hiding techniques
slide-3
SLIDE 3

Native Signal Processing

  • Single-cycle multiply-accumulate (MAC) operation
  • Vector dot products, digital filters, and correlation
  • Missing extended precision accumulation
  • Single-instruction multiple-data (SIMD) processing
  • UltraSPARC Visual Instruction Set (VIS) and Pentium MMX:

64-bit registers, 8-bit and 16-bit fixed-point arithmetic

  • Pentium III, K6-2 3DNow!: 64-bit registers, 32-bit floating-point
  • PowerPC AltiVec: 128-bit registers, 4x32 bit floating-point MACs
  • Must hand-code using intrinsics and assembly code

3

i

α

i

x

i=1 N

slide-4
SLIDE 4

Visual Instruction Set

  • 50 new CPU instructions for UltraSPARC

63 47 31 15

A1 A2 A3 A4

63 47 31 15

B1 B2 B3 B4

63 47 31 15

A1+B1 A2+B2 A3+B3 A4+B4 vis_d64 + + + + vis_d64 vis_d64 vis_fpadd16

  • Inline function library provided for use from C/C++
  • Independent operation on each data cell (SIMD)
  • Optimized for video and image processing
  • Partitioned data types in 32-bit or 64-bit FP registers
  • Includes arithmetic and logic, packing and unpacking,

alignment and data conversion, etc.

4

slide-5
SLIDE 5

Memory Latency Hiding

  • Fast processor stalls when accessing slow memory
  • Cache memories can help to alleviate this problem
  • High-throughput streams of data amplify this problem
  • Software techniques can reduce the penalty

5

  • Technique: Loop unrolling
  • Enlarges basic block size and reduces looping overhead
  • Can increase the time between data request and consumption
  • Low risk and no overhead, commonly used by compilers
  • Technique: Software pipelining
  • Data load and usage overlaped from different loop iterations
  • Increases register usage and lifetimes, hard for compiler
slide-6
SLIDE 6

Software Data Prefetching

  • Non-blocking prefetch CPU instruction
  • Issued at some time prior to when data is needed
  • Data at effective address is brought into cache
  • At a later load instruction, the data is already cached
  • Can be generated by a compiler
  • Implemented in the UltraSPARC-II CPU

6

  • Problems: overhead and “prefetch distance”
  • Uses extra cache and issues extra instructions
  • Prefetch too far ahead: excessive cache usage, spillage
  • Not far enough ahead: stall at load instruction
slide-7
SLIDE 7

Sonar Beamforming

7

  • Typically the computational bottleneck in sonar
  • High throughput streams of data
  • Goal: best performance using any means
  • We evaluate two key kernels for 3-D beamforming
slide-8
SLIDE 8

Time-Domain Beamforming

b(t) = αi xi(t–τi)

Σ

i = 1 M b(t) beam outputi xi(t) ith sensor output τi ith sensor delay αi ith sensor weight

  • Delay-and-sum weighted sensor outputs
  • Geometrically project the sensor elements onto a

line to compute the time delays

  • 20
  • 15
  • 10
  • 5

5 10 15 20

  • 5

5 10 15 20

Projection for a beam pointing 20° off axis x position, inches 20° sensor element projected element 8

slide-9
SLIDE 9
  • Modeled as a sparse FIR filter

Horizontal Beamformer

Interpolate z-N1 Interpolate z-NM

Σ

b[n]

  • Digital Interpolation Beamformer

Stave data at interval ∆ Interpolate up to interval δ = ∆/L Time delay at interval δ α1 αM

  • Sample at just above the Nyquist rate, interpolate

to obtain desired time delay resolution

  • Forming 61 beams from 80 elements with 2-point interpolation
  • 3000 index lookup plus 6000 floating-point MACs per sample
  • At each sample: 12 Kbytes of data, coefficient size of 36 Kbytes

9

Single beam output

slide-10
SLIDE 10

Multiple vertical transducers for every horizontal position stave

Vertical Beamformer

  • Vertical columns combined into 3 stave outputs
  • Multiple dot products (30 MACs per stave per sample)
  • Convert integer to floating-point for following stages
  • Ideal candidate for the Visual Instruction Set (VIS)
  • Use integer dot products (16x16-bit multiply, 32-bit add)
  • Highest precision (and slowest) VIS mode
  • Coefficients must be scaled for best dynamic range

10

slide-11
SLIDE 11

Tools Utilized

  • Sun’s SPARCompiler5.0
  • Automated prefetch instruction generation?
  • Inline assembly macros for VIS instructions
  • Wrote assembly macros for prefetch and fitos instructions
  • Shade: pficount (prefetch instruction counter)
  • INCAS (It’s a Nearly Cycle-Accurate Simulator)
  • perf-monitor: hardware performance counters
  • Benchmarks on a 336 MHz UltraSPARC-II

11

slide-12
SLIDE 12

Horizontal Kernel Performance

  • Hand loop unrolling gives speedup of 2.4
  • Multiple passes improve cache usage (93% / 97%)
  • Inline PREFETCH “breaks” compiler optimization

1 2 3 4 5 6 7 150 200 250 300 350 400 450

  • uter loop unrolling

maximum: 1.32 FLOPC 2.19 IPC 444 MFLOPS 66% of peak

multiple pass single pass inline PREFETCH

12

slide-13
SLIDE 13

Vertical Kernel Performance

  • VIS offers a 46% boost over floating-point
  • Software prefetching gives an additional 34%
  • 104 MB/s data input, 62.7 MB/s data output

50 100 150 200 250 300 350 1 2 3 4 5 6 7 8 9

floating point floating point in asm floating point, VIS loading int (no VIS) VIS baseline VIS, unrolled inner loop VIS, add double-loading VIS, reschedule and pipeline VIS, add software prefetching MFLOPS (or MIOPS) 0.93 IOPC 1.41 IPC 313 MIOPS 93% of peak

13

slide-14
SLIDE 14

Vertical Prefetch Statistics

  • Breakdown of execution time
  • Execution cycles (no stall) constant across trials
  • Internal cache statistics do not change

0.5 1 1.5 2 2.5 3 1 2 3 4

no prefetching write prefetching only read prefetching only read/write prefetching Exec time (sec)

no stall load stall store stall

14

slide-15
SLIDE 15

Conclusion

15

  • Beamforming kernel results:
  • Kernel optimization is difficult and time consuming
  • Compiler did not generate prefetch instructions
  • For high-throughput real-time signal processing,

general purpose CPUs can be an attractive target

  • Near-peak performance can be achieved, but
  • Horizontal beamformer kernel: 444 MFLOPS, 66% of peak
  • Vertical beamformer kernel: 313 MFLOPS, 93% of peak
  • Loop unrolling: 2.4 speedup in horizontal kernel
  • VIS: 1.46 speedup in vertical kernel
  • prefetching: 1.34 speedup in vertical kernel