A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality - - PowerPoint PPT Presentation

a hardware friendly bilateral solver for real time
SMART_READER_LITE
LIVE PREVIEW

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality - - PowerPoint PPT Presentation

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video Amrita Mazumdar University of Washington Armin Alaghi Jonathan T. Barron Google David Gallup Luis Ceze Mark Oskin University of Washington Steven M. Seitz 1


slide-1
SLIDE 1

A Hardware-Friendly 
 Bilateral Solver for 
 Real-Time Virtual-Reality Video

Amrita Mazumdar Armin Alaghi Jonathan T. Barron David Gallup Luis Ceze Mark Oskin Steven M. Seitz

University of Washington Google University of Washington

1

slide-2
SLIDE 2

virtual reality video with omnidirectional stereo (ODS)

2

slide-3
SLIDE 3

the Google Jump camera rig can capture ODS video easily

16 GoPros x 4K camera feed 3.6 GB/s raw video

3

slide-4
SLIDE 4

4

the Google Jump camera rig can capture ODS video easily

Anderson et al., SIGGRAPH Asia 2016

slide-5
SLIDE 5

5

the Google Jump camera rig can capture ODS video easily

Anderson et al., SIGGRAPH Asia 2016

slide-6
SLIDE 6

processing video from Google Jump is slow

1 hour of video 10 hours

  • n 1000 cores

6

Anderson et al., SIGGRAPH Asia 2016

slide-7
SLIDE 7

Google Jump pipeline breakdown

sensor

download to viewer pre- processing alignment

  • ptical

flow

compositing

7

Anderson et al., SIGGRAPH Asia 2016

slide-8
SLIDE 8

download to viewer pre- processing alignment

  • ptical

flow

compositing

12% 69% 17% 2%

the bilateral solver dominates processing time

8

Google Jump pipeline breakdown

Anderson et al., SIGGRAPH Asia 2016

sensor

slide-9
SLIDE 9

The bilateral solver produces an image that is smooth and accurate.

9

input pair 
 (from two cameras) blocky flow field upsample into noisy flow field transform to bilateral grid and solve

  • utput result:

smooth flow field Anderson et al., SIGGRAPH Asia 2016

slide-10
SLIDE 10

this work: a hardware-friendly bilateral solver (HFBS)

slide-11
SLIDE 11

The bilateral solver is hard to parallelize

second-order global optimization global communication prevents aggressive parallelization high-dimensional, sparse matrices sparsity results in significant divergence on GPUs why not a dense grid? too large to store on-chip

11

slide-12
SLIDE 12

Barron Poole 2016 HFBS (our work)

✅ includes color grayscale only dense matrix too big to fit in memory ✅ dense matrix fits in memory global communication required ✅ local communication only iterative bistochastization before solving ✅ partial, non-iterative bistochastization

HFBS is easier to parallelize

detailed formulation in paper

slide-13
SLIDE 13

HFBS demonstrates imperceptible accuracy loss

task: Ferstl et al., ICCV 2013, 
 data: Middlebury stereo dataset

input image noisy depth map Barron Poole 2016 HFBS (this work)

13

slide-14
SLIDE 14

algorithm optimizations make it easier to implement bilateral solver in parallel hardware

14

slide-15
SLIDE 15

15

plan: exploit this parallelism with a custom hardware accelerator

algorithm optimizations make it easier to implement bilateral solver in parallel hardware

slide-16
SLIDE 16

Mapping HFBS to hardware

16

download to viewer pre- processing alignment

  • ptical

flow

compositing

sensor

slide-17
SLIDE 17

17

load video pair construct bilateral grid per pair perform hardware- friendly bilateral solver slice out solution into output images

CPU FPGA

Mapping HFBS to hardware

download to viewer pre- processing alignment

  • ptical

flow

compositing

sensor

slide-18
SLIDE 18

microarchitecture

18

CPU main memory AXI memory interface HFBS controller z-axis memory controller z-axis memory bank z-axis memory bank z-axis memory bank bilateral filter worker bilateral filter worker bilateral filter worker memory access selector

fixed-point datapath custom memory layout

slide-19
SLIDE 19

Floating-point resource requirements limit hardware parallelism

float64 32-bit fixed 64-bit fixed 47-bit fixed DSPs per worker 18 1 16 4 Maximum # workers 379 6840 427 1710 Error (MSE)

  • 8.3 x 10-4

7.16 x 10-13 6.69 x 10-7

19

slide-20
SLIDE 20

Fixed-point datapath conversion

20

Error 
 (MSE relative to float64) 1E-12 1E-10 1E-08 1E-06 1E-04 1E-02 Decimal Precision (Fraction of Bitwidth) 40% 50% 60% 70% 80% 90% Max Error

32 64 47

Bitwidth

slide-21
SLIDE 21

z-axis slicing for bilateral grid memory layout

21

x:0,y:0,r:255,g:172,b:0 x:0,y:1,r:255,g:172,b:0 . . . . . x:100,y:100,r:255,g:172,b:0 z = 0

slide-22
SLIDE 22

Evaluation

slide-23
SLIDE 23

23

Evaluation

download to viewer pre- processing alignment

  • ptical

flow

compositing

12% 69% 17% 2%

Does HFBS improve runtime? How does parallelization affect power?

sensor

slide-24
SLIDE 24

Experimental Setup

CPU: Intel Xeon E5-2620 GPU: NVIDIA GTX 1080 Ti FPGA: Xilinx Virtex Ultrascale+ Baseline: Barron Poole et al. 2016 (CPU only) 256 iterations of optimization Varied bilateral grid vertices count 
 ⇒ 4 KB - 1.8 GB grid sizes

24

slide-25
SLIDE 25

HFBS is faster and more scalable than prior work.

25

log Runtime (ms)

0.01 1 100 10000

log Bilateral Grid Vertices

1,000 100,000 10,000,000 Prior Work (CPU) CPU GPU FPGA

slide-26
SLIDE 26

26

30 FPS and better

log Runtime (ms)

0.01 1 100 10000

log Bilateral Grid Vertices

1,000 100,000 10,000,000 Prior Work (CPU) CPU GPU FPGA

HFBS is faster and more scalable than prior work.

slide-27
SLIDE 27

HFBS-FPGA is more power-efficient than other platforms

Ops / Watt Improvement

10 20 30 40

Power-efficiency relative to prior work

30.72x 2.12x 0.45x 1.00x

Prior Work CPU GPU FPGA

27

slide-28
SLIDE 28

building a VR video camera rig with HFBS

slide-29
SLIDE 29

29

this work full system

slide-30
SLIDE 30

HFBS-FPGA consumes much less power than a GPU for the same task

30

16 GPUs = 4,560 W full system 16 FPGAs = 400 W

slide-31
SLIDE 31

HFBS makes real-time VR video more feasible with FPGAs

  • ffloaded to cloud

31

  • n-node with FPGAs

sensor

download to viewer pre- processing alignment

  • ptical

flow

compositing

slide-32
SLIDE 32

to conclude

fast, parallel implementation of bilateral solving with little accuracy loss fixed-point datatypes and a custom bilateral-grid memory layout for improved FPGA performance hardware-software codesign to reduce latency and improve quality for future VR applications

32

slide-33
SLIDE 33

parallel algorithm for bilateral solving FPGA architecture 50x faster, 30x more power-efficient

33

A Hardware-Friendly Bilateral Solver for Real-Time Virtual Reality Video