Exploring Computation- Communication Tradeoffs in Camera Systems
1
Amrita Mazumdar Thierry Moreau Sung Kim Meghan Cowan Armin Alaghi Luis Ceze Mark Oskin Visvesh Sathe
IISWC 2017
Exploring Computation- Communication Tradeo ff s in Camera Systems - - PowerPoint PPT Presentation
Exploring Computation- Communication Tradeo ff s in Camera Systems Amrita Mazumdar Armin Alaghi Thierry Moreau Luis Ceze Sung Kim Mark Oskin Meghan Cowan Visvesh Sathe IISWC 2017 1 Camera applications are a prominent workload with tight
1
Amrita Mazumdar Thierry Moreau Sung Kim Meghan Cowan Armin Alaghi Luis Ceze Mark Oskin Visvesh Sathe
IISWC 2017
video surveillance cameras 3D-360 virtual reality camera rig
2
large data size large data size
energy harvesting camera augmented reality glasses
light weight light weight real-time processing real-time processing real-time processing low-power low-power
constraint
power time size
bandwidth
implementation
ASIC FPGA DSP CPU GPU
3
camera system DogChat™
sensor block 1 block 2 block 3 block 4
4
functions in the application
5
DogChat™
sensor image processing face detection feature tracking image rendering
6
DogChat™
sensor image processing face detection feature tracking image rendering
7
DogChat™
in-camera processing
sensor image processing face detection feature tracking image rendering
8
edge detection motion detection motion tracking
required
sensor image processing face detection feature tracking image rendering
sensor image processing face detection feature tracking image rendering edge detection motion detection motion tracking
9
ASIC FPGA DSP CPU GPU DSP FPGA
required
sensor image processing face detection feature tracking image rendering edge detection motion detection motion tracking
10
ASIC FPGA DSP CPU GPU DSP FPGA
required
In-camera processing pipelines can help us evaluate these tradeoffs!
11
motion detection face detection neural network
prep align depth stitch
12
motion detection face detection neural network
prep align depth stitch
WISP Cam energy-harvesting camera powered by RF 1 frame / second ~1 mW processing / frame
13
Is this Armin? ✅
14
sensor neural network
application functions
cloud
15
sensor neural network
application functions
ASIC hardware cloud
16
face detection motion detection
circuit
Evaluated NN topology and hardware impact on energy and accuracy Selected a 400-8-1 network topology and used 8-bit datapaths for optimal energy/accuracy point
17
SNNAP DMA Master Bus Scheduler PU SRAM control PE PE SIG
...
MUL MUL MUL MUL weight weight weight weight d_in ADD ADD ADD ADD
acc. fifo sig. fifo sigmoid unit
26 26 26 26 26 26 26 26acc
16 26 8 26acc
PE0 PE1 PE2 PE3 8d_out
feature unit integral accumulator VJ integral image accumulator classifier unit window buffer stage unit threshold unit feature unit pixels in input row integral row>
Streaming face detection accelerator Explored classifier and other algorithm parameters to optimize energy optimality
neural network face detection
Synthesized ASIC accelerators in Synopsys Constructed simulator to evaluate power consumption on real-world video input Computed power for computation and transfer of resulting data for each pipeline configuration
18
19
platform configuration
compute transfer
sensor <1% >99% sensor motion <1% >99% sensor face detect 10% 90% sensor NN 16% 84% sensor motion face detect >99% <1% sensor motion NN >99% <1% sensor face detect NN >99% <1% sensor motion face detect NN >99% <1%
log Power (µW)
1 1000 1000000 160 419 257,236 132 782,090 374 3,731 11,340 (ratios)
20
platform configuration
compute transfer
sensor <1% >99% sensor motion <1% >99% sensor face detect 10% 90% sensor NN 16% 84% sensor motion face detect >99% <1% sensor motion NN >99% <1% sensor face detect NN >99% <1% sensor motion face detect NN >99% <1%
log Power (µW)
1 1000 1000000 160 419 257,236 132 782,090 374 3,731 11,340 (ratios)
prefilters reduce
21
platform configuration
compute transfer
sensor <1% >99% sensor motion <1% >99% sensor face detect 10% 90% sensor NN 16% 84% sensor motion face detect >99% <1% sensor motion NN >99% <1% sensor face detect NN >99% <1% sensor motion face detect NN >99% <1%
log Power (µW)
1 1000 1000000 160 419 257,236 132 782,090 374 3,731 11,340 (ratios)
just using NN prefilters with NN use less power
22
platform configuration
compute transfer
sensor <1% >99% sensor motion <1% >99% sensor face detect 10% 90% sensor NN 16% 84% sensor motion face detect >99% <1% sensor motion NN >99% <1% sensor face detect NN >99% <1% sensor motion face detect NN >99% <1%
log Power (µW)
1 1000 1000000 160 419 257,236 132 782,090 374 3,731 11,340 (ratios)
most power- efficient most power- efficient with
23
motion detection face detection neural network
24
motion detection face detection neural network
prep align depth stitch
25
motion detection face detection neural network
prep align depth stitch
26
16 GoPro cameras 4K-30 fps 3.6 GB/s raw video Goal: 30 fps 3D-360 stereo video 1.8 GB/s output
27
16 GoPro cameras 4K-30 fps 3.6 GB/s raw video Goal: 30 fps 3D-360 stereo video 1.8 GB/s output
cloud processing prevents real- time video
28
prep image align depth from flow image stitch sensor stream to viewer
5% 20% 70% 5% processing time
need to accelerate “depth from flow” to achieve high performance
29
prep image align depth from flow image stitch sensor stream to viewer
Video Frame Size (MB) 150 300 450 600
image alignment step produces significant intermediate data
still 2x final output size
30
Designed a simple parallel accelerator for Xilinx Zynq SoC, simulated for Virtex UltraScale+ Evaluated against CPU and GPU implementations in Halide Assumed 2GB/s network link for communication
implementation details in paper
31
pipeline configuration
compute transfer
sensor
100 15.8
sensor prep
100 15.8
sensor prep align
100 3.95
sensor prep align depth (CPU)
0.09 5.27
sensor prep align depth (GPU)
11.2 5.27
sensor prep align depth (FPGA)
174 5.27
sensor prep align depth (CPU) stitch
0.09 31.6
sensor prep align depth (GPU) stitch
11.2 31.6
sensor prep align depth (FPGA) stitch
174 31.6
effective FPS
7 14 21 28 35 31.6 11.2 0.1 5.3 5.3 0.1 4.0 15.8 15.8 .09 .09
(FPS)
32
pipeline configuration
compute transfer
sensor
100 15.8
sensor prep
100 15.8
sensor prep align
100 3.95
sensor prep align depth (CPU)
0.09 5.27
sensor prep align depth (GPU)
11.2 5.27
sensor prep align depth (FPGA)
174 5.27
sensor prep align depth (CPU) stitch
0.09 31.6
sensor prep align depth (GPU) stitch
11.2 31.6
sensor prep align depth (FPGA) stitch
174 31.6
effective FPS
7 14 21 28 35 31.6 11.2 0.1 5.3 5.3 0.1 4.0 15.8 15.8 .09 .09
(FPS)
CPU results are slowest
33
pipeline configuration
compute transfer
sensor
100 15.8
sensor prep
100 15.8
sensor prep align
100 3.95
sensor prep align depth (CPU)
0.09 5.27
sensor prep align depth (GPU)
11.2 5.27
sensor prep align depth (FPGA)
174 5.27
sensor prep align depth (CPU) stitch
0.09 31.6
sensor prep align depth (GPU) stitch
11.2 31.6
sensor prep align depth (FPGA) stitch
174 31.6
effective FPS
7 14 21 28 35 31.6 11.2 0.1 5.3 5.3 0.1 4.0 15.8 15.8 .09 .09
(FPS)
Data size is too big after depth for
34
pipeline configuration
compute transfer
sensor
100 15.8
sensor prep
100 15.8
sensor prep align
100 3.95
sensor prep align depth (CPU)
0.09 5.27
sensor prep align depth (GPU)
11.2 5.27
sensor prep align depth (FPGA)
174 5.27
sensor prep align depth (CPU) stitch
0.09 31.6
sensor prep align depth (GPU) stitch
11.2 31.6
sensor prep align depth (FPGA) stitch
174 31.6
effective FPS
7 14 21 28 35 31.6 11.2 0.1 5.3 5.3 0.1 4.0 15.8 15.8 .09 .09
(FPS)
full pipeline with FPGA is only one that achieves real- time frame rate
Computation and communication together highlight benefits not seen when considered separately For VR video, in-camera processing pipelines enable applications that could not even be achieved via cloud offload
35 prep align depth stitch
In-camera pipelines evaluate computation-communication trade-offs Use hardware-software co-design to balance constraints and
Achieve optimal performance by considering bottlenecks in context of full system