Exploring Computation- Communication Tradeo ff s in Camera Systems - - PowerPoint PPT Presentation

exploring computation communication tradeo ff s in camera
SMART_READER_LITE
LIVE PREVIEW

Exploring Computation- Communication Tradeo ff s in Camera Systems - - PowerPoint PPT Presentation

Exploring Computation- Communication Tradeo ff s in Camera Systems Amrita Mazumdar Armin Alaghi Thierry Moreau Luis Ceze Sung Kim Mark Oskin Meghan Cowan Visvesh Sathe IISWC 2017 1 Camera applications are a prominent workload with tight


slide-1
SLIDE 1

Exploring Computation- Communication Tradeoffs in Camera Systems

1

Amrita Mazumdar Thierry Moreau Sung Kim Meghan Cowan
 Armin Alaghi Luis Ceze Mark Oskin Visvesh Sathe

IISWC 2017

slide-2
SLIDE 2

video surveillance cameras 3D-360 virtual reality camera rig

Camera applications are a prominent workload with tight constraints

2

large data size large data size

energy harvesting camera augmented reality glasses

light weight light weight real-time processing real-time processing real-time processing low-power low-power

slide-3
SLIDE 3

Hardware implementations compound the camera system design space

constraint

power time size

bandwidth

implementation

ASIC FPGA DSP CPU GPU

3

camera system DogChat™

slide-4
SLIDE 4

We can represent camera applications as 
 camera processing pipelines 
 to clarify design space exploration

sensor block 1 block 2 block 3 block 4

4

functions in the application

slide-5
SLIDE 5

5

DogChat™

sensor image processing face detection feature tracking image rendering

We can represent camera applications as 
 camera processing pipelines 
 to clarify design space exploration

slide-6
SLIDE 6

6

DogChat™

sensor image processing face detection feature tracking image rendering

  • ffloaded to cloud

Developers can trade off between computation and communication costs

slide-7
SLIDE 7

7

DogChat™

Developers can trade off between computation and communication costs

  • ffloaded to cloud

in-camera processing

sensor image processing face detection feature tracking image rendering

slide-8
SLIDE 8

8

Optional and required blocks in camera pipelines introduce more tradeoffs

edge detection motion detection motion tracking

required

  • ptional

sensor image processing face detection feature tracking image rendering

slide-9
SLIDE 9

sensor image processing face detection feature tracking image rendering edge detection motion detection motion tracking

Custom hardware platforms explode the camera system design space

9

ASIC FPGA DSP CPU GPU DSP FPGA

required

  • ptional
slide-10
SLIDE 10

sensor image processing face detection feature tracking image rendering edge detection motion detection motion tracking

Custom hardware platforms explode the camera system design space

10

ASIC FPGA DSP CPU GPU DSP FPGA

required

  • ptional

In-camera processing pipelines can help us evaluate these tradeoffs!

slide-11
SLIDE 11

Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design
 Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration
 


11

motion detection face detection neural network

prep align depth stitch

slide-12
SLIDE 12

Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design
 Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration
 


12

motion detection face detection neural network

prep align depth stitch

slide-13
SLIDE 13

Face authentication with energy harvesting cameras

WISP Cam energy-harvesting camera powered by RF 1 frame / second ~1 mW processing / frame

13

slide-14
SLIDE 14

Is this Armin? ✅

14

Face authentication with energy harvesting cameras

slide-15
SLIDE 15

sensor neural network

  • ther

application functions

  • n-chip CPU

cloud

15

CPU-based face authentication neural networks can exceed WISPcam power budgets

slide-16
SLIDE 16

sensor neural network

  • ther

application functions

ASIC hardware cloud

16

adding optional blocks can reduce power consumption for a neural network

face detection motion detection

  • n-chip

circuit

CPU-based face authentication neural networks can exceed WISPcam power budgets

slide-17
SLIDE 17

Exploring design tradeoffs in ASIC accelerators

Evaluated NN topology and hardware impact on energy and accuracy Selected a 400-8-1 network topology and used 8-bit datapaths for optimal energy/accuracy point

17

SNNAP DMA Master Bus Scheduler PU SRAM control PE PE SIG

...

MUL MUL MUL MUL weight weight weight weight d_in ADD ADD ADD ADD

  • ffset
8 8 8 8 8 8 8 8 16 16 16 16

acc. fifo sig. fifo sigmoid unit

26 26 26 26 26 26 26 26

acc

16 26 8 26

acc

PE0 PE1 PE2 PE3 8

d_out

feature unit integral accumulator VJ integral image accumulator classifier unit window buffer stage unit threshold unit feature unit pixels in input row integral row
  • utput
4 4 1 += 2 3 1 1 1 1 6 7 2 threshold ‘yes’ weight ‘no’ weight a d b c + +
  • x
+ +

>

  • x
  • x
weight1 + + a d b c weight2 + + a d b c weight3 previous row + + +

Streaming face detection accelerator Explored classifier and other algorithm parameters to optimize energy optimality

neural network face detection

many more details in paper!

slide-18
SLIDE 18

Synthesized ASIC accelerators in Synopsys Constructed simulator to evaluate power consumption on real-world video input Computed power for computation and transfer of resulting data for each pipeline configuration

18

Evaluation

Which pipeline achieves the lowest

  • verall power?
slide-19
SLIDE 19

Which pipeline achieves the lowest power consumption?

19

platform configuration

compute transfer

sensor <1% >99% sensor motion <1% >99% sensor face detect 10% 90% sensor NN 16% 84% sensor motion face detect >99% <1% sensor motion NN >99% <1% sensor face detect NN >99% <1% sensor motion face detect NN >99% <1%

log Power (µW)

1 1000 1000000 160 419 257,236 132 782,090 374 3,731 11,340 (ratios)

slide-20
SLIDE 20

Which pipeline achieves the lowest power consumption?

20

platform configuration

compute transfer

sensor <1% >99% sensor motion <1% >99% sensor face detect 10% 90% sensor NN 16% 84% sensor motion face detect >99% <1% sensor motion NN >99% <1% sensor face detect NN >99% <1% sensor motion face detect NN >99% <1%

log Power (µW)

1 1000 1000000 160 419 257,236 132 782,090 374 3,731 11,340 (ratios)

prefilters reduce

  • verall power
slide-21
SLIDE 21

Which pipeline achieves the lowest power consumption?

21

platform configuration

compute transfer

sensor <1% >99% sensor motion <1% >99% sensor face detect 10% 90% sensor NN 16% 84% sensor motion face detect >99% <1% sensor motion NN >99% <1% sensor face detect NN >99% <1% sensor motion face detect NN >99% <1%

log Power (µW)

1 1000 1000000 160 419 257,236 132 782,090 374 3,731 11,340 (ratios)

just using NN prefilters with NN use less power

slide-22
SLIDE 22

Which pipeline achieves the lowest power consumption?

22

platform configuration

compute transfer

sensor <1% >99% sensor motion <1% >99% sensor face detect 10% 90% sensor NN 16% 84% sensor motion face detect >99% <1% sensor motion NN >99% <1% sensor face detect NN >99% <1% sensor motion face detect NN >99% <1%

log Power (µW)

1 1000 1000000 160 419 257,236 132 782,090 374 3,731 11,340 (ratios)

most power- efficient most power- efficient with

  • n-chip NN
slide-23
SLIDE 23

In-camera processing for face authentication

In isolation, even well-designed hardware 
 can show sub-optimal performance Optional blocks can improve the overall cost,
 if they balance compute and communication
 better than the original design

23

motion detection face detection neural network

slide-24
SLIDE 24

Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design
 Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration
 


24

motion detection face detection neural network

prep align depth stitch

slide-25
SLIDE 25

Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design
 Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration
 


25

motion detection face detection neural network

prep align depth stitch

slide-26
SLIDE 26

26

16 GoPro cameras 4K-30 fps 3.6 GB/s raw video Goal: 30 fps 3D-360 stereo video 1.8 GB/s output

Producing real-time VR video from a camera rig

slide-27
SLIDE 27

27

16 GoPro cameras 4K-30 fps 3.6 GB/s raw video Goal: 30 fps 3D-360 stereo video 1.8 GB/s output

Producing real-time VR video from a camera rig

cloud processing prevents real- time video

slide-28
SLIDE 28

28

  • ffloaded to cloud

prep image align depth from flow image stitch sensor stream to viewer

VR pipeline is usually offloaded to perform heavy computation

5% 20% 70% 5% processing time

need to accelerate “depth from flow” to achieve high performance

slide-29
SLIDE 29

29

prep image align depth from flow image stitch sensor stream to viewer

Offloading before the costly step doesn’t avoid compute-communication tradeoffs

Video Frame Size (MB) 150 300 450 600

image alignment step produces significant intermediate data

  • ffloading early on is

still 2x final output size

slide-30
SLIDE 30

Evaluation

30

Designed a simple parallel accelerator for Xilinx Zynq SoC, simulated for Virtex UltraScale+ Evaluated against CPU and GPU implementations in Halide Assumed 2GB/s network link for communication

Which pipeline achieves the highest frame rate?

implementation details in paper

slide-31
SLIDE 31

Which pipeline achieves the highest frame rate?

31

pipeline configuration

compute transfer

sensor

100 15.8

sensor prep

100 15.8

sensor prep align

100 3.95

sensor prep align depth (CPU)

0.09 5.27

sensor prep align depth (GPU)

11.2 5.27

sensor prep align depth (FPGA)

174 5.27

sensor prep align depth (CPU) stitch

0.09 31.6

sensor prep align depth (GPU) stitch

11.2 31.6

sensor prep align depth (FPGA) stitch

174 31.6

effective FPS

7 14 21 28 35 31.6 11.2 0.1 5.3 5.3 0.1 4.0 15.8 15.8 .09 .09

(FPS)

slide-32
SLIDE 32

Which pipeline achieves the highest frame rate?

32

pipeline configuration

compute transfer

sensor

100 15.8

sensor prep

100 15.8

sensor prep align

100 3.95

sensor prep align depth (CPU)

0.09 5.27

sensor prep align depth (GPU)

11.2 5.27

sensor prep align depth (FPGA)

174 5.27

sensor prep align depth (CPU) stitch

0.09 31.6

sensor prep align depth (GPU) stitch

11.2 31.6

sensor prep align depth (FPGA) stitch

174 31.6

effective FPS

7 14 21 28 35 31.6 11.2 0.1 5.3 5.3 0.1 4.0 15.8 15.8 .09 .09

(FPS)

CPU results are slowest

slide-33
SLIDE 33

Which pipeline achieves the highest frame rate?

33

pipeline configuration

compute transfer

sensor

100 15.8

sensor prep

100 15.8

sensor prep align

100 3.95

sensor prep align depth (CPU)

0.09 5.27

sensor prep align depth (GPU)

11.2 5.27

sensor prep align depth (FPGA)

174 5.27

sensor prep align depth (CPU) stitch

0.09 31.6

sensor prep align depth (GPU) stitch

11.2 31.6

sensor prep align depth (FPGA) stitch

174 31.6

effective FPS

7 14 21 28 35 31.6 11.2 0.1 5.3 5.3 0.1 4.0 15.8 15.8 .09 .09

(FPS)

Data size is too big after depth for

  • ffloading
slide-34
SLIDE 34

Which pipeline achieves the highest frame rate?

34

pipeline configuration

compute transfer

sensor

100 15.8

sensor prep

100 15.8

sensor prep align

100 3.95

sensor prep align depth (CPU)

0.09 5.27

sensor prep align depth (GPU)

11.2 5.27

sensor prep align depth (FPGA)

174 5.27

sensor prep align depth (CPU) stitch

0.09 31.6

sensor prep align depth (GPU) stitch

11.2 31.6

sensor prep align depth (FPGA) stitch

174 31.6

effective FPS

7 14 21 28 35 31.6 11.2 0.1 5.3 5.3 0.1 4.0 15.8 15.8 .09 .09

(FPS)

full pipeline with FPGA is only one that achieves real- time frame rate

slide-35
SLIDE 35

In-camera processing for real-time VR video

Computation and communication together highlight benefits not seen when considered separately For VR video, in-camera processing pipelines enable applications that could not even be achieved via cloud offload

35 prep align depth stitch

slide-36
SLIDE 36

In-camera pipelines evaluate computation-communication trade-offs Use hardware-software co-design to balance constraints and

  • ptimize designs

Achieve optimal performance by considering bottlenecks in context of full system

In-camera processing pipelines help characterize camera systems

Thank you!