estimator for graph analytics on fpga
play

Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, - PowerPoint PPT Presentation

Energy-Efficient Stochastic Matrix Function Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, Raphael Polig IBM Research Zurich 26th International Conference on Field- Programmable Logic and Applications 29th August 2nd


  1. Energy-Efficient Stochastic Matrix Function Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, Raphael Polig IBM Research – Zurich 26th International Conference on Field- Programmable Logic and Applications 29th August – 2nd September 2016 SwissTech Convention Centre Lausanne, Switzerland

  2. Motivation Journals (9052) Proteins (549832) • Knowledge graphs appear in many areas of basic research Diseases (9100) Drugs (8148) Symptoms (1433) • These knowledge graphs can MeSH (35158) become very big (e.g. cover around ~80M papers and 10M patents) Pubmed • We want to extract hidden (644890) correlations in these graphs Authors (1869746) System-Biology Knowledge Graph 8/31/2016 2

  3. Graph Analyt ytics Use Cases To extract hidden correlations in these graphs, we need to apply advanced graph-algorithms. Examples are: 1. Subgraph-centralities: Find the most relevant nodes by ranking them according to the number of closed walks 2. Spectral-methods: Compare large graphs by looking at their spectrum 8/31/2016 3

  4. Graph Analyt ytics Use Cases To extract hidden correlations in these graphs, we need to apply advanced graph-algorithms. Examples are: 1. Subgraph-centralities: Find the most Requires us to diagonalize the relevant nodes by ranking them adjacency matrix of the graph. according to the number of closed walks This has a complexity of O(N 3 ) A graph of 1M nodes requires exascale computing 2. Spectral-methods: compare large graphs by looking at their spectrum 8/31/2016 4

  5. Node Centrality for Ranking Nodes in in a Graph • Subgraph centrality • Total number of closed walks in the network • The number of walks of length 𝑚 in 𝐵 from 𝑣 to 𝑤 is 𝐵 𝑚 𝑣𝑤 • Subgraph centrality considers all possible walks, shorter walks have higher importance : 𝐵 2 𝐵 3 𝐵 4 𝐵 5 1 + 𝐵 + 2! + 3! + 4! + 5! + ⋯ • Taylor series for the exponential function 𝑓 𝐵  weighted sum of all paths in 𝐵 • Consider only closed walks  𝑑 𝑗 = 𝐸𝑗𝑏𝑕 𝑓 𝐵 𝑗 • Explicit computation of matrix exponentials is difficult • Though 𝐵 is sparse, 𝐵 𝑚 becomes dense  huge memory footprint • Exascale compute requirements for exact solutions 8/31/2016 5

  6. Observations • Observation 1: We only need an approximate solution • We do not need highly accurate results to obtain a good ranking! • We do not need to know exact value of the eigenvalues in order to have a histogram of the spectrum of A! • Observation 2: In both operations, we need to compute a subset of elements of a matrix-functional • In the case of the subgraph-centrality, we need the diagonal of e A • In the case of the spectrogram, we need to compute the trace of multiple step- functions 8/31/2016 6

  7. Stochastic Matrix-Function Estimator (S (SME) Framework to approximate (a subset of elements of) the matrix f (A), where f is an arbitrary function and A is the adjacency matrix of the graph [1]. R = zero(); Use Ns test vectors in blocks of size Nb for l = 1 to Ns/Nb do forall e in V do Initialize the Nb columns of V with random -1/1 (2%) e = (rand()/RAND_MAX<0.5) ? -1.0 : 1.0; done M0 = V Compute W = f(A) V with Chebyshev polynomials of W = c[0] * V // AXPY the first kind. (97% of run time) M1 = A * V // SPMM W = c[1] * M1 + W // AXPY for m = 2 to Nc do M0 = 2 * A * M1 - M0 // SPMM W = c[m] * M0 + W // AXPY pointer_swap(M0,M1) done R += W * V T // SGEMM / DOT Accumulate partial results over test vectors (1%) done E[f(A)] = R/Ns Normalize to get final result [1] Peter W. J. Staar, Panagiotis Kl. Barkoutsos, Roxana Istrate, A. Cristiano I. Malossi, Ivano Tavernelli,Nikolaj Moll, Heiner Giefers, Christoph Hagleitner, Costas Bekas, and Alessandro Curioni . “Stochastic Matrix-Function Estimators: Scalable Big-Data Kernels with High Performance.” IPDPS 2016. (received Best Paper Award) 8/31/2016 7

  8. Accelerated Stochastic Matrix-Function Estimator CPU FPGA R = zero(); for l = 1 to Ns/Nb do forall e in V do e = (rand()/RAND_MAX<0.5) ? -1.0 : 1.0; V done M0 = V W = c[0] * V // AXPY M1 = A * V // SPMM W = c[1] * M1 + W // AXPY for m = 2 to Nc do M0 = 2 * A * M1 - M0 // SPMM W = c[m] * M0 + W // AXPY pointer_swap(M0,M1) W done R += W * V T // SGEMM / DOT done E[f(A)] = R/Ns V W … 8/31/2016 8

  9. Accelerated Stochastic Matrix-Function Estimator CPU FPGA FPGA Map the entire outer R = zero(); for l = 1 to Ns/Nb do loop onto the FPGA forall e in V do • (Almost) no host- e = (rand()/RAND_MAX<0.5) ? -1.0 : 1.0; V done device communication M0 = V • 3 sequential stages W = c[0] * V // AXPY M1 = A * V // SPMM • No double buffering W = c[1] * M1 + W // AXPY for m = 2 to Nc do needed M0 = 2 * A * M1 - M0 // SPMM • 4 asynchronous W = c[m] * M0 + W // AXPY pointer_swap(M0,M1) W kernels in inner loop done R += W * V T // SGEMM / DOT done E[f(A)] = R/Ns V … W … 8/31/2016 9

  10. SME Architecture – Random Number Generator • xorshift64 based random number ulong2 xorshift64s (ulong x){ ulong2 res; generator to generate Rademacher x ^= x >> 12; x ^= x << 25; distribution x ^= x >> 27; res.x = x; • High quality, passes many passes res.y = x * 2685821657736338717ull; return res; many statistical tests [2] } • Well suited for FPGA implementation __kernel void rng(float *M0,*W,*V,cm, uint num, ulong seed){ • Initialize V, M0, and W on-the-fly ulong2 rngs = {rand, 0xdecafbad}; ulong rs; float rn; for(unsigned k = 0; k < num; k+=N_UNROLL){ seed cm 0 rngs = xorshift64s(rngs.x); V rs = rngs.y; #pragma unroll N_UNROLL RNG M0 for(unsigned b = 0; b < N_UNROLL; b++){ rn = ((rs >> b) & 0x1) ? -1.0 : 1.0; (incl. RHS init) W V[k+b] = rn; M0[k+b] = rn; W[k+b] = cm*rn; } [2] George Marsaglia . “ Xorshift RNGs,” Journal of Statistical Software, 2003. } 8/31/2016 10

  11. SME Architecture: CSR Sparse Matrix Mult ltiplication 6 8 1 6 8 1 6 7 2 2 5 8 5 7 2 4 6 2 6 8 1 5 7 7 8 A 6 7 2 2 5 0 2 6 1 3 2 3 7 0 4 5 1 6 7 0 1 2 1 2 4 2 5 JA = x 8 5 7 0 3 5 8 11 14 17 20 22 IA 3 4 6 2 6 8 1 5 7 7 8 sparse matrix in CSR format sparse matrix-matrix multiplication rows rows int IA c_e JA CSR c_A c_S SpMM Reader A c_JA 128-wide float4 SIMD c_rhs • Asynchronous kernels RHS M0 M0 • AXPY Synchronization via … Prefetcher M0 FIFO channels W W $ float16 nnz rows cm 8/31/2016 11

  12. Resource Util ilization for Kernels on Stratix-V V 5SGXA7 60 Inner loop 50 40 30 20 10 0 RNG matrix_prefetch rhs_prefetch SpMM AXPY accu_result LEs FFs RAMs DSPs 8/31/2016 12

  13. SME on Heterogeneous System POWER8 heterogeneous node 1. Dual-socket 6-core CPU, 96 threads • IBM xlC compiler using OpenMP and Atlas BLAS 2. NVIDIA Tesla K40 GPU • CUDA 7.5 with cuBLAS • Self-developed SpMM outperforms cusparseScsrmm() 3. Nallatech PCIe-385 card w/ Altera Stratix-V FPGA • Altera OpenCL HLS 8/31/2016 13

  14. SME – Approximation Quality on the 3 Pla latforms • Estimation quality depends on several factors • Number of test vectors • Number of terms in Chebyshev expansion • Quality of the random number generator used to initialize the test vectors • Precision of floating point operations 8/31/2016 14

  15. Power Profiling • POWER8 On-Chip Controller (OCC) • Enables fast, scalable monitoring (ns timescale) • OCC is implemented in a POWERPC 405 • Uses continuous running, real-time OS • Monitors workload activity, chip temperature and current • Trace power consumption using Amester • Tool for out-of-band monitoring of POWER8 servers • Open sourced on github: github.com/open-power/amester • Current sensors for various domains (socket, memory buffer/DIMM, GPU, PCIe, fan, …) • Compute power consumption: 𝑄 𝑑𝑝𝑛𝑞 = 𝑄 𝑢𝑝𝑢𝑏𝑚 − 𝑄 𝑗𝑒𝑚𝑓 8/31/2016 15

  16. Application-Level Power Traces Device reconfiguration CPU (6 threads) FPGA GPU CPU (1 thread) 8/31/2016 16

  17. SME – Energy-Efficiency Analysis Platform Run time [s] Dynamic Power [W] Energy to Solution [kJ] CPU 172.55 143.92 24.83 Fastest CPU version (6 threads) CPU 232.31 57.01 13.24 Most efficient CPU version (1 thread) GPU 19.52 155.42 3.03 FPGA 114.00 9.13 1.04 FPGA is ~6x slower but ~3x more energy-efficient compared to the GPU CPU IBM POWER8 2-socket 12-core FPGA Nallatech PCIe-385 with Altera Stratix-V GPU NVIDIA K40 8/31/2016 17

  18. Conclusion • Accelerators outperform the CPU. GPUs are dominant in terms of absolute performance • GPU is 12x, FPGA 2x faster than a CPU core • The compute energy for the FPGA outstanding • 3x better compared tor GPU, 13x better compared to the CPU • What about the idle power? (~550W for the system we used) • We need energy-proportional computing • Cloud: Accelerators free CPU cycles • Cloud-FPGA: Standalone, network- attached FPGA to remove “host overhead” • OpenCL increased productivity Relative Performance • Short design time, (almost) no verification • Optimization is cumbersome 8/31/2016 18

  19. Questions? Heiner Giefers IBM Research – Zurich hgi@zurich.ibm.com 26th International Conference on Field- Programmable Logic and Applications 29th August – 2nd September 2016 SwissTech Convention Centre Lausanne, Switzerland

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend