real time high throughput sonar beamforming kernels using
play

Real-Time High-Throughput Sonar Beamforming Kernels Using Native - PowerPoint PPT Presentation

Real-Time High-Throughput Sonar Beamforming Kernels Using Native Signal Processing and Memory Latency Hiding Techniques Gregory E. Allen 1 1 Brian L. Evans Lizy K. John Department of Electrical and Computer Engineering The University of


  1. Real-Time High-Throughput Sonar Beamforming Kernels Using Native Signal Processing and Memory Latency Hiding Techniques Gregory E. Allen 1 1 Brian L. Evans Lizy K. John Department of Electrical and Computer Engineering The University of Texas at Austin http://www.ece.utexas.edu/~allen/

  2. Introduction •Sonar beamforming is computationally intensive • GFLOPS of computation • 100 MB/s of data input/output •Current real-time implementation technologies • Custom hardware • Custom integration using commercial-off-the-shelf (COTS) processors (e.g. 100 digital signal processors in a VME chassis) • Low production volume (50 units), high development cost •Examine performance of commodity computers • Native signal processing, multimedia instruction sets • Memory latency hiding techniques 2

  3. Native Signal Processing •Single-cycle multiply-accumulate (MAC) operation • Vector dot products, digital filters, and correlation N α ∑ x i i i = 1 • Missing extended precision accumulation •Single-instruction multiple-data (SIMD) processing • UltraSPARC Visual Instruction Set (VIS) and Pentium MMX : 64-bit registers, 8-bit and 16-bit fixed-point arithmetic • Pentium III , K6-2 3DNow! : 64-bit registers, 32-bit floating-point • PowerPC AltiVec: 128-bit registers, 4x32 bit floating-point MACs •Must hand-code using intrinsics and assembly code 3

  4. Visual Instruction Set •50 new CPU instructions for UltraSPARC • Optimized for video and image processing • Partitioned data types in 32-bit or 64-bit FP registers • Includes arithmetic and logic, packing and unpacking, alignment and data conversion, etc. •Independent operation on each data cell (SIMD) vis_d64 A1 A2 A3 A4 63 47 31 15 0 vis_d64 B1 B2 B3 B4 63 47 31 15 0 vis_fpadd16 + + + + vis_d64 A1+B1 A2+B2 A3+B3 A4+B4 63 47 31 15 0 •Inline function library provided for use from C/C++ 4

  5. Memory Latency Hiding •Fast processor stalls when accessing slow memory • Cache memories can help to alleviate this problem • High-throughput streams of data amplify this problem • Software techniques can reduce the penalty •Technique: Loop unrolling • Enlarges basic block size and reduces looping overhead • Can increase the time between data request and consumption • Low risk and no overhead, commonly used by compilers •Technique: Software pipelining • Data load and usage overlaped from different loop iterations • Increases register usage and lifetimes, hard for compiler 5

  6. Software Data Prefetching •Non-blocking prefetch CPU instruction • Issued at some time prior to when data is needed • Data at effective address is brought into cache • At a later load instruction, the data is already cached •Problems: overhead and “prefetch distance” • Uses extra cache and issues extra instructions • Prefetch too far ahead: excessive cache usage, spillage • Not far enough ahead: stall at load instruction •Can be generated by a compiler •Implemented in the UltraSPARC-II CPU 6

  7. Sonar Beamforming •We evaluate two key kernels for 3-D beamforming •Typically the computational bottleneck in sonar •High throughput streams of data • Goal: best performance using any means 7

  8. Time-Domain Beamforming •Delay-and-sum weighted sensor outputs •Geometrically project the sensor elements onto a line to compute the time delays Projection for a beam pointing 20° off axis 20 M Σ α i x i (t– τ i ) b(t) = 15 i = 1 20° 10 b(t) beam outputi ith sensor output xi(t) 5 τ i ith sensor delay 0 α i ith sensor weight sensor element projected element -5 -20 -15 -10 -5 0 5 10 15 20 x position, inches 8

  9. Horizontal Beamformer •Sample at just above the Nyquist rate, interpolate to obtain desired time delay resolution Interpolate up to Time delay α 1 interval δ = ∆ /L at interval δ Single z -N 1 Interpolate beam output α M Σ Stave data at • • b [ n ] interval ∆ • • z -N M Interpolate Digital Interpolation Beamformer •Modeled as a sparse FIR filter • Forming 61 beams from 80 elements with 2-point interpolation • 3000 index lookup plus 6000 floating-point MACs per sample • At each sample: 12 Kbytes of data, coefficient size of 36 Kbytes 9

  10. Vertical Beamformer stave Multiple vertical transducers for every horizontal position •Vertical columns combined into 3 stave outputs • Multiple dot products (30 MACs per stave per sample) • Convert integer to floating-point for following stages •Ideal candidate for the Visual Instruction Set (VIS) • Use integer dot products (16x16-bit multiply, 32-bit add) • Highest precision (and slowest) VIS mode • Coefficients must be scaled for best dynamic range 10

  11. Tools Utilized •Sun’s SPARCompiler5.0 • Automated prefetch instruction generation? • Inline assembly macros for VIS instructions • Wrote assembly macros for prefetch and fitos instructions •Shade: pficount (prefetch instruction counter) •INCAS (It’s a Nearly Cycle-Accurate Simulator) •perf-monitor: hardware performance counters •Benchmarks on a 336 MHz UltraSPARC-II 11

  12. Horizontal Kernel Performance 450 maximum: 400 1.32 FLOPC 2.19 IPC 444 MFLOPS 350 66% of peak 300 multiple pass single pass inline PREFETCH 250 200 150 1 2 3 4 5 6 7 outer loop unrolling •Hand loop unrolling gives speedup of 2.4 •Multiple passes improve cache usage (93% / 97%) •Inline PREFETCH “breaks” compiler optimization 12

  13. Vertical Kernel Performance 9 VIS, add software prefetching 0.93 IOPC 8 VIS, reschedule and pipeline 1.41 IPC 7 VIS, add double-loading 313 MIOPS 93% of peak 6 VIS, unrolled inner loop 5 VIS baseline 4 int (no VIS) 3 floating point, VIS loading 2 floating point in asm 1 floating point 0 50 100 150 200 250 300 350 MFLOPS (or MIOPS) •VIS offers a 46% boost over floating-point •Software prefetching gives an additional 34% •104 MB/s data input, 62.7 MB/s data output 13

  14. Vertical Prefetch Statistics no stall load stall 4 read/write prefetching store stall 3 read prefetching only 2 write prefetching only 1 no prefetching 0 0.5 1 1.5 2 2.5 3 Exec time (sec) •Breakdown of execution time •Execution cycles (no stall) constant across trials •Internal cache statistics do not change 14

  15. Conclusion •Beamforming kernel results: • Horizontal beamformer kernel: 444 MFLOPS, 66% of peak • Vertical beamformer kernel: 313 MFLOPS, 93% of peak • Loop unrolling: 2.4 speedup in horizontal kernel • VIS: 1.46 speedup in vertical kernel • prefetching: 1.34 speedup in vertical kernel •Near-peak performance can be achieved, but • Kernel optimization is difficult and time consuming • Compiler did not generate prefetch instructions •For high-throughput real-time signal processing, general purpose CPUs can be an attractive target 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend