Computational Glass-Free Displays Zhuolun He and Guojie Luo Peking - - PowerPoint PPT Presentation

computational glass free
SMART_READER_LITE
LIVE PREVIEW

Computational Glass-Free Displays Zhuolun He and Guojie Luo Peking - - PowerPoint PPT Presentation

FPGA Acceleration for Computational Glass-Free Displays Zhuolun He and Guojie Luo Peking University FPGA, Feb. 2017 Motivation: hyperopia/myopia Issues 2 Background Technology: Glass-Free Display Light-field display [Huang and


slide-1
SLIDE 1

FPGA Acceleration for Computational Glass-Free Displays

Zhuolun He and Guojie Luo Peking University FPGA, Feb. 2017

slide-2
SLIDE 2

Motivation: hyperopia/myopia Issues

2

slide-3
SLIDE 3

Background Technology: Glass-Free Display

  • Light-field display

– [Huang and Wetzstein, SIGGRAPH 2014]

  • Correcting for visual aberrations

– Display: predistorted content – Retina: desired image

3

Target light field display retina Desired perception

slide-4
SLIDE 4

Related Technologies: Light Field Camera

4

slide-5
SLIDE 5

Related: Near-eye Light-field Display

5

Source: NVIDIA, SIGGRAPH Asia 2013

slide-6
SLIDE 6

Pinhole Array vs. Microlens

6

One 75um pinhole in every 390um manufactured using lithography

slide-7
SLIDE 7

In this Paper…

  • Analyze the computational kernels
  • Accelerate using FPGAs
  • Propose several optimizations

7

slide-8
SLIDE 8

Computational Glass-Free Display

8

Target light field display retina Desired perception = * T

T

T x P u

slide-9
SLIDE 9

Casting as a Model Fitting Problem

9

minimize 𝑔 𝑦 = ∥ 𝑣 − 𝑄𝑦 ∥ 2 subeject to 0 ≤ 𝑦 ≤ 1

= * T

T

T x P u

slide-10
SLIDE 10

Background of the L-BFGS Algorithm

  • L-BFGS: a widely-used convex optimization algorithm

10

Calculate gradient 𝛼𝑔(𝑦𝑙) Calculate direction 𝑞𝑙 Update 𝑦𝑙+1 = 𝑦𝑙 + 𝛽𝑙𝑞𝑙 Search for step length 𝛽𝑙

Converged?

done

N Y

slide-11
SLIDE 11

Background of the L-BFGS Algorithm

  • L-BFGS algorithm

– Input: (history size = m)

▫ 𝑡

𝑘 = 𝑦𝑘+1 − 𝑦𝑘

▫ 𝑧𝑘 = 𝛼𝑔(𝑦𝑘+1) − 𝛼𝑔(𝑦𝑘)

– Output: direction 𝑞𝑙

  • Computational kernels

– dot prod – vector updates

11

𝑦𝑙−𝑛+1 ⋯ 𝑦𝑙 𝛼𝑔 𝑦𝑙−𝑛+1 𝛼𝑔 𝑦𝑙 𝑞𝑙 = −𝛼𝑔(𝑦𝑙) for 𝑗 = 𝑙 − 1 to 𝑙 − 𝑛 do 𝛽𝑗 = (𝑡𝑗 ∙ 𝑞𝑙) (𝑡𝑗 ∙ 𝑧𝑗) 𝑞𝑙 = 𝑞𝑙 − 𝛽𝑗𝑧𝑗 end for 𝑞𝑙 = 𝑞𝑙 ∙ (𝑡𝑙−1 ∙ 𝑧𝑙−1) (𝑧𝑙−1 ∙ 𝑧𝑙−1) for 𝑗 = 𝑙 − 𝑛 to 𝑙 − 1 do 𝛾𝑗 = (𝑧𝑗 ∙ 𝑞𝑙) (𝑡𝑗 ∙ 𝑧𝑗) 𝑞𝑙 = 𝑞𝑙 +(𝛽𝑗 − 𝛾𝑗)𝑡𝑗 end for return direction 𝑞𝑙 # more work # some work

slide-12
SLIDE 12

Vector-free L-BFGS Algorithm

  • Original idea

– [NIPS 2014]

  • Observation

– 𝑞𝑙 is a linear combination of some basis in {𝑡

𝑘} and {𝑧𝑘}

  • Techniques

– dot prod ⇒ lookup + scalar op. – vector update ⇒ coeff. update

12

𝑞𝑙 = −𝛼𝑔(𝑦𝑙) for 𝑗 = 𝑙 − 1 to 𝑙 − 𝑛 do 𝛽𝑗 = (𝑡𝑗 ∙ 𝑞𝑙) (𝑡𝑗 ∙ 𝑧𝑗) 𝑞𝑙 = 𝑞𝑙 − 𝛽𝑗𝑧𝑗 end for 𝑞𝑙 = 𝑞𝑙 ∙ (𝑡𝑙−1 ∙ 𝑧𝑙−1) (𝑧𝑙−1 ∙ 𝑧𝑙−1) for 𝑗 = 𝑙 − 𝑛 to 𝑙 − 1 do 𝛾𝑗 = (𝑧𝑗 ∙ 𝑞𝑙) (𝑡𝑗 ∙ 𝑧𝑗) 𝑞𝑙 = 𝑞𝑙 +(𝛽𝑗 − 𝛾𝑗)𝑡𝑗 end for return direction 𝑞𝑙 # more work # some work

slide-13
SLIDE 13

13

{coeff.} scalar {coeff.} scalar {coeff.} {coeff.} scalar {coeff.} dotprod table 𝑞𝑙

dotprod() s-v mult. v-v add. dotprod() s-v mult. v−v add. v−v add. dotprod() s-v mult.

𝛼𝑔(𝑦𝑙) 𝑞𝑙 Original L-BFGS Vector-free L-BFGS

slide-14
SLIDE 14
  • Similar idea to reduce data transfers

– dot prod ⇒ lookup + scalar op. – vector update => coeff. update

Updating the Dot Product Table

14

Scenario Focus [NIPS 2014] Distributed computing using MapReduce minimize #syncs Ours FPGA acceleration with small on-chip BRAM minimize data transfers

slide-15
SLIDE 15

– m: history size (e.g., 10) – d: image size

Distributed vs. FPGA-based

15

Scenario Focus data transfer [NIPS 2014] Distributed computing using MapReduce minimize #syncs 8md Ours FPGA acceleration with small on-chip BRAM minimize data transfers (4m+4)d

slide-16
SLIDE 16

Sparse Matrix-Vector Multiplication

  • Size of matrix/vector

– Sparse matrix 𝑄: 16384*490000 – Variable 𝑦: 490000

16

minimize 𝑔 𝑦 = ∥ 𝑣 − 𝑄𝑦 ∥2

slide-17
SLIDE 17

Sparse Matrix-Vector Multiplication

  • Problem: storage of P
  • Solution:

– Sparsity => compressed row storage (CRS) – Range of indices => bitwidth reduction – #unique values => look-up table (LUT)

▫ ~ 810K non-zero entries ▫ ~600 unique values

17

minimize 𝑔 𝑦 = ∥ 𝑣 − 𝑄𝑦 ∥2

Format Storage (MB) flat 32112.64 COO 6.63 CRS 5.24 CRS+LUT 2.90

slide-18
SLIDE 18

Sparse Matrix-Vector Multiplication

  • Problem: partitioning vector 𝑦
  • “Solution”:

– Matrix 𝑄 is irregular but constant – => access pattern is non-affine but statistically analyzable – => enumerate factors of |𝑦| as partitioning factors

18

minimize 𝑔 𝑦 = ∥ 𝑣 − 𝑄𝑦 ∥2

Factor 𝑶 Method Min cycle/row Max cycle/row Total cycle 980 cyclic 1 1 16384 1225 cyclic 1 1 16384 1250 cyclic 1 2 19840 … … … … … 1400 block 4 18 188564 1250 block 5 18 193276 … … … … … 1 N/A 37 54 816272

slide-19
SLIDE 19

Overall Design of the Accelerator

  • [Li et al, FPGA 15]
  • Maximize performance
  • Subject to resources

19

slide-20
SLIDE 20

Experimental Evaluation

20

124.5 65.49 47.47 25.26 9.74

20 40 60 80 100 120 140

Time (s)

Baseline SpMV optimization L-BFGS enhancement

Overall result after other fine tunings parameter tuning in L-BFGS +: 12.78X Speedup

  • : Peak memory bandwidth < 800MB/s

Runtime Comparison

slide-21
SLIDE 21

Conclusions

  • Summary

– Bandwidth-friendly L-BFGS algorithm – Application-specific sparse matrix compression – Memory partitioning for non-affine access

  • Future work

– Possibility of real-time processing – Construct transformation matrix by eye-ball tracking – A demonstrative system

21

slide-22
SLIDE 22

Questions?

22

slide-23
SLIDE 23

Runtime Profiling of a 2-min L-BFGS

per procedure per operation

23

slide-24
SLIDE 24

24

slide-25
SLIDE 25

25