FPGA Acceleration for Computational Glass-Free Displays
Zhuolun He and Guojie Luo Peking University FPGA, Feb. 2017
Computational Glass-Free Displays Zhuolun He and Guojie Luo Peking - - PowerPoint PPT Presentation
FPGA Acceleration for Computational Glass-Free Displays Zhuolun He and Guojie Luo Peking University FPGA, Feb. 2017 Motivation: hyperopia/myopia Issues 2 Background Technology: Glass-Free Display Light-field display [Huang and
Zhuolun He and Guojie Luo Peking University FPGA, Feb. 2017
2
– [Huang and Wetzstein, SIGGRAPH 2014]
– Display: predistorted content – Retina: desired image
3
Target light field display retina Desired perception
4
5
Source: NVIDIA, SIGGRAPH Asia 2013
6
One 75um pinhole in every 390um manufactured using lithography
7
8
Target light field display retina Desired perception = * T
T
T x P u
9
minimize 𝑔 𝑦 = ∥ 𝑣 − 𝑄𝑦 ∥ 2 subeject to 0 ≤ 𝑦 ≤ 1
= * T
T
T x P u
10
Calculate gradient 𝛼𝑔(𝑦𝑙) Calculate direction 𝑞𝑙 Update 𝑦𝑙+1 = 𝑦𝑙 + 𝛽𝑙𝑞𝑙 Search for step length 𝛽𝑙
Converged?
done
N Y
– Input: (history size = m)
▫ 𝑡
𝑘 = 𝑦𝑘+1 − 𝑦𝑘
▫ 𝑧𝑘 = 𝛼𝑔(𝑦𝑘+1) − 𝛼𝑔(𝑦𝑘)
– Output: direction 𝑞𝑙
– dot prod – vector updates
11
𝑦𝑙−𝑛+1 ⋯ 𝑦𝑙 𝛼𝑔 𝑦𝑙−𝑛+1 𝛼𝑔 𝑦𝑙 𝑞𝑙 = −𝛼𝑔(𝑦𝑙) for 𝑗 = 𝑙 − 1 to 𝑙 − 𝑛 do 𝛽𝑗 = (𝑡𝑗 ∙ 𝑞𝑙) (𝑡𝑗 ∙ 𝑧𝑗) 𝑞𝑙 = 𝑞𝑙 − 𝛽𝑗𝑧𝑗 end for 𝑞𝑙 = 𝑞𝑙 ∙ (𝑡𝑙−1 ∙ 𝑧𝑙−1) (𝑧𝑙−1 ∙ 𝑧𝑙−1) for 𝑗 = 𝑙 − 𝑛 to 𝑙 − 1 do 𝛾𝑗 = (𝑧𝑗 ∙ 𝑞𝑙) (𝑡𝑗 ∙ 𝑧𝑗) 𝑞𝑙 = 𝑞𝑙 +(𝛽𝑗 − 𝛾𝑗)𝑡𝑗 end for return direction 𝑞𝑙 # more work # some work
– [NIPS 2014]
– 𝑞𝑙 is a linear combination of some basis in {𝑡
𝑘} and {𝑧𝑘}
– dot prod ⇒ lookup + scalar op. – vector update ⇒ coeff. update
12
𝑞𝑙 = −𝛼𝑔(𝑦𝑙) for 𝑗 = 𝑙 − 1 to 𝑙 − 𝑛 do 𝛽𝑗 = (𝑡𝑗 ∙ 𝑞𝑙) (𝑡𝑗 ∙ 𝑧𝑗) 𝑞𝑙 = 𝑞𝑙 − 𝛽𝑗𝑧𝑗 end for 𝑞𝑙 = 𝑞𝑙 ∙ (𝑡𝑙−1 ∙ 𝑧𝑙−1) (𝑧𝑙−1 ∙ 𝑧𝑙−1) for 𝑗 = 𝑙 − 𝑛 to 𝑙 − 1 do 𝛾𝑗 = (𝑧𝑗 ∙ 𝑞𝑙) (𝑡𝑗 ∙ 𝑧𝑗) 𝑞𝑙 = 𝑞𝑙 +(𝛽𝑗 − 𝛾𝑗)𝑡𝑗 end for return direction 𝑞𝑙 # more work # some work
13
{coeff.} scalar {coeff.} scalar {coeff.} {coeff.} scalar {coeff.} dotprod table 𝑞𝑙
dotprod() s-v mult. v-v add. dotprod() s-v mult. v−v add. v−v add. dotprod() s-v mult.
𝛼𝑔(𝑦𝑙) 𝑞𝑙 Original L-BFGS Vector-free L-BFGS
– dot prod ⇒ lookup + scalar op. – vector update => coeff. update
14
Scenario Focus [NIPS 2014] Distributed computing using MapReduce minimize #syncs Ours FPGA acceleration with small on-chip BRAM minimize data transfers
– m: history size (e.g., 10) – d: image size
15
Scenario Focus data transfer [NIPS 2014] Distributed computing using MapReduce minimize #syncs 8md Ours FPGA acceleration with small on-chip BRAM minimize data transfers (4m+4)d
– Sparse matrix 𝑄: 16384*490000 – Variable 𝑦: 490000
16
minimize 𝑔 𝑦 = ∥ 𝑣 − 𝑄𝑦 ∥2
– Sparsity => compressed row storage (CRS) – Range of indices => bitwidth reduction – #unique values => look-up table (LUT)
▫ ~ 810K non-zero entries ▫ ~600 unique values
17
minimize 𝑔 𝑦 = ∥ 𝑣 − 𝑄𝑦 ∥2
Format Storage (MB) flat 32112.64 COO 6.63 CRS 5.24 CRS+LUT 2.90
– Matrix 𝑄 is irregular but constant – => access pattern is non-affine but statistically analyzable – => enumerate factors of |𝑦| as partitioning factors
18
minimize 𝑔 𝑦 = ∥ 𝑣 − 𝑄𝑦 ∥2
Factor 𝑶 Method Min cycle/row Max cycle/row Total cycle 980 cyclic 1 1 16384 1225 cyclic 1 1 16384 1250 cyclic 1 2 19840 … … … … … 1400 block 4 18 188564 1250 block 5 18 193276 … … … … … 1 N/A 37 54 816272
19
20
124.5 65.49 47.47 25.26 9.74
20 40 60 80 100 120 140
Time (s)
Baseline SpMV optimization L-BFGS enhancement
Overall result after other fine tunings parameter tuning in L-BFGS +: 12.78X Speedup
Runtime Comparison
– Bandwidth-friendly L-BFGS algorithm – Application-specific sparse matrix compression – Memory partitioning for non-affine access
– Possibility of real-time processing – Construct transformation matrix by eye-ball tracking – A demonstrative system
21
22
per procedure per operation
23
24
25