opencl based erasure coding on heterogeneous architectures
play

OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang - PowerPoint PPT Presentation

OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North Carolina State University) Josh Gahm, Narayan Venkat, Skip Booth, John Marshall (Cisco Systems, Inc) Email: gchen11@ncsu.edu 1


  1. OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North Carolina State University) Josh Gahm, Narayan Venkat, Skip Booth, John Marshall (Cisco Systems, Inc) Email: gchen11@ncsu.edu 1

  2. Introduction • A key challege in storage system o Failure(disk sector, entile disk, storage site) • A Solution: o Erasure Coding • Intel’s intelligent storage acceleration library.(ISA-L) From google image 2

  3. Motivation • Erasure Coding o Replication.(simple, high cost, low toleration) o Reed-Solomon coding.(less cost, high toleration, complex) o ...... • Motivation: o To explore using various heterogeneous architectures to accelerate Reed-Solomon coding. 3

  4. Reed-Solomon Coding • Block- based Parity Encoding İnputs are partitioned into ‘srcs’ blocks, with a block size of ‘ length ’ bytes. o Encode matrix: dests > srcs o Dest = V × Src 4

  5. Reed-Solomon Coding • Block- based Parity Encoding İnputs are partitioned into ‘srcs’ blocks, with a block size of ‘ length ’ bytes. o Encode matrix: dests > srcs o 𝑡𝑠𝑑𝑡−1 𝐸𝑓𝑡𝑢 𝑚 𝑗 = 𝑊 𝑚 𝑘 × 𝑇𝑠𝑑 𝑘 [𝑗] 𝑘=0 Dest = V × Src 5

  6. Reed-Solomon Coding • Block- based Parity Encoding İnputs are partitioned into ‘srcs’ blocks, with a block size of ‘ length ’ bytes. o Encode matrix: dests > srcs o • sum: 8-bit XOR operation; mul: GF(2 8 ) multiplication 𝑡𝑠𝑑𝑡−1 𝐸𝑓𝑡𝑢 𝑚 𝑗 = 𝑊 𝑚 𝑘 × 𝑇𝑠𝑑 𝑘 [𝑗] 𝑘=0 Dest = V × Src 6

  7. GF(2 8 ) multiplication • 3 Ways for Galois Field Multiplication: o Russian Peasant Algorithm: pure logic operations. o 2 small tables: 256 bytes per table, 3 table lookups, 3 logic operations. o 1 large table: 256*256 bytes, no logic operations, one lookup Refer to paper for details. 7

  8. Reed-solomon Coding On CPUs • Intel ISA-L. o Single thread. o Baseline. • Adding Multithreading support. o Partition input matrix in a column-wise manner. 8

  9. Reed-solomon Coding On GPUs • Computation for one element in output matrix is independent from others. • Fine-grain parallelization o Each workitem for one byte in output matrix.(Baseline) • Optimizations??? 9

  10. Reed-solomon Coding On GPUs-Opt(A) • A. Optimize GPU Memory Bandwidth. o Memory coalescing(workitems in one group access data in the same row). o Vectorization.(reads uint4 one time) ==> higher bandwidth. • Each workitem for 16 bytes data. 10

  11. Reed-solomon Coding On GPUs-Opt(B) • B. Overcoming Memory Bandwidth Limit Using Texture Caches, Tiling. o Workitems in the same row share same value in V . ==> Putting encode matrix and large look up table(64KB, for GF(2 8 ) Multiplication) in texture cache. Dest = V × Src 11

  12. Reed-solomon Coding On GPUs-Opt(B) • B. Overcoming Memory Bandwidth Limit Using Texture Caches, Tiling. o Workitems in the same row share same value in V . ==> Putting encode matrix and large look up table(64KB, for GF(2 8 ) Multiplication) in texture cache. o Src in texture cache by using tiling(like MM). • Not helpful. Bottoleneck: computation bound Dest = V × Src 12

  13. Reed-solomon Coding On GPUs-Opt(C) • C. Hiding Data Transmission Latency Over PCIe o Partition input into multiple groups. • One stream for one group o Hide data copy time with computation time. Stream 1 H2D Compute D2H H2D Compute D2H Stream 2 Stream 3 H2D Compute D2H ..... ...... ....... ...... Stream N H2D Compute D2H 13

  14. Reed-solomon Coding On GPUs-Opt(D) • D. Shared virtual memory to eliminate memory copying o Shared virtual memory (SVM) is supported in OpenCL 2.0 • AMD APUs. • No need for data copy. 14

  15. Reed-solomon Coding On FPGAs • FPGAs o Abundant on-chip logics for computation. o Pipelined parallelism instead of data parallelism on GPU. o Relatively low memory access bandwidth • Reed-solomon Coding o Computation bound o A good candidate for FPGAs o Same baseline code as used on GPUs. (1 workitem for 1 byte) 15

  16. Reed-solomon Coding On FPGAs-Opt(A) • A. Vectorization to Optimize FPGA Memory Bandwidth o One workitem reads 64 bytes from input. 16

  17. Reed-solomon Coding On FPGAs-Opt(B) • B. Overcoming memory bandwidth limit using tiling. o Load a tile from input matrix to local memory shared by workgroup. o A large tile size results in high data reuse and reduces off- chip memory bandwidth 17

  18. Reed-solomon Coding On FPGAs-Opt(C) • C. Unroll loop and Kernel replication to fully utilize FPGA logic resources. o __attribute__(num_compute_units(n)): n pipelines. o Loop unroll: deeper pipleline. 18

  19. Experiments • Input: 836.9MB file. • On CPU: Intel(R) Xeon(R) CPU E5-2697 v3 (28 cores) • On GPU: NVIDIA K40m, CUDA7.0; AMD Carrizo. • On FPGA: Altera Stratix V A7. 19

  20. On CPU • srcs = 30, dests = 33 Encode Bandwidth 3 2.84 2.5 2 GB/s 1.5 1 0.5 56 0 0 20 40 60 80 100 120 number of threads 20

  21. On NVIDIA K40m • One Stream: Best: large table (2.15GB/s) o • 8 Streams: == 3.9GB/s Encode Bandwidth 21

  22. On AMD Carrizo SVM • Not as good as streaming. Texture cache doesn’t work well. o Overhead of blocking functions to map and unmap SVM buffers. o Encode Bandwidth 0.6 0.5 0.4 GB/s 0.3 0.2 0.1 0 char int int4 char int int4 SVM Streaming 22

  23. On FPGA Encode Bandwidth 10 • DMA read/write about 3GB/s. 1 GB/s • Only focus on 0.1 kernel throughput. 0.01 • Assume DMA engine can be 0.001 easily increased. int16 int16 char int16+tiling+unroll int int16 + tiling char int16+tiling+unroll Large Table Small Table Russian Peasant 23

  24. Overall • Considering the price, FPGA platform is most promising but needs to improve its current PCIe DMA interface. 8 GPU FPGA MC-CPU ST-CPU 7 6 dests = srcs + 3 GB/s 5 4 3 2 1 0 10 15 20 25 30 srcs 24

  25. NEW-update: Kernel + Memory Copy between Host and Device Encode BW (GB/s) 7 6 5 4 3 2 1 0 file1 file2 file1 file2 file1 file2 file1 file2 BDW+SVM BDW Arria10 StratixV file 1 has a size of 29MB; file 2 has a size of 438MB BDW: Integrated FPGA (arria 10) on Xeon core. SVM (Shared Virtual Memory): the Map/unMap overhead is included Arria 10: discrete FPGA board through PCIe. 25 Stratix V: discrete FPGA board through PCIe.

  26. Conclusions • Explore different computing devices for erasure codes. • Different optimizations for different devices. • FPGA is the most promising device for erasure codes. 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend