Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, - PowerPoint PPT Presentation

Energy-Efficient Stochastic Matrix Function Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, Raphael Polig IBM Research – Zurich 26th International Conference on Field- Programmable Logic and Applications 29th August – 2nd September 2016 SwissTech Convention Centre Lausanne, Switzerland

Motivation Journals (9052) Proteins (549832) • Knowledge graphs appear in many areas of basic research Diseases (9100) Drugs (8148) Symptoms (1433) • These knowledge graphs can MeSH (35158) become very big (e.g. cover around ~80M papers and 10M patents) Pubmed • We want to extract hidden (644890) correlations in these graphs Authors (1869746) System-Biology Knowledge Graph 8/31/2016 2

Graph Analyt ytics Use Cases To extract hidden correlations in these graphs, we need to apply advanced graph-algorithms. Examples are: 1. Subgraph-centralities: Find the most relevant nodes by ranking them according to the number of closed walks 2. Spectral-methods: Compare large graphs by looking at their spectrum 8/31/2016 3

Graph Analyt ytics Use Cases To extract hidden correlations in these graphs, we need to apply advanced graph-algorithms. Examples are: 1. Subgraph-centralities: Find the most Requires us to diagonalize the relevant nodes by ranking them adjacency matrix of the graph. according to the number of closed walks This has a complexity of O(N 3 ) A graph of 1M nodes requires exascale computing 2. Spectral-methods: compare large graphs by looking at their spectrum 8/31/2016 4

Node Centrality for Ranking Nodes in in a Graph • Subgraph centrality • Total number of closed walks in the network • The number of walks of length 𝑚 in 𝐵 from 𝑣 to 𝑤 is 𝐵 𝑚 𝑣𝑤 • Subgraph centrality considers all possible walks, shorter walks have higher importance : 𝐵 2 𝐵 3 𝐵 4 𝐵 5 1 + 𝐵 + 2! + 3! + 4! + 5! + ⋯ • Taylor series for the exponential function 𝑓 𝐵  weighted sum of all paths in 𝐵 • Consider only closed walks  𝑑 𝑗 = 𝐸𝑗𝑏𝑕 𝑓 𝐵 𝑗 • Explicit computation of matrix exponentials is difficult • Though 𝐵 is sparse, 𝐵 𝑚 becomes dense  huge memory footprint • Exascale compute requirements for exact solutions 8/31/2016 5

Observations • Observation 1: We only need an approximate solution • We do not need highly accurate results to obtain a good ranking! • We do not need to know exact value of the eigenvalues in order to have a histogram of the spectrum of A! • Observation 2: In both operations, we need to compute a subset of elements of a matrix-functional • In the case of the subgraph-centrality, we need the diagonal of e A • In the case of the spectrogram, we need to compute the trace of multiple step- functions 8/31/2016 6

Stochastic Matrix-Function Estimator (S (SME) Framework to approximate (a subset of elements of) the matrix f (A), where f is an arbitrary function and A is the adjacency matrix of the graph [1]. R = zero(); Use Ns test vectors in blocks of size Nb for l = 1 to Ns/Nb do forall e in V do Initialize the Nb columns of V with random -1/1 (2%) e = (rand()/RAND_MAX<0.5) ? -1.0 : 1.0; done M0 = V Compute W = f(A) V with Chebyshev polynomials of W = c[0] * V // AXPY the first kind. (97% of run time) M1 = A * V // SPMM W = c[1] * M1 + W // AXPY for m = 2 to Nc do M0 = 2 * A * M1 - M0 // SPMM W = c[m] * M0 + W // AXPY pointer_swap(M0,M1) done R += W * V T // SGEMM / DOT Accumulate partial results over test vectors (1%) done E[f(A)] = R/Ns Normalize to get final result [1] Peter W. J. Staar, Panagiotis Kl. Barkoutsos, Roxana Istrate, A. Cristiano I. Malossi, Ivano Tavernelli,Nikolaj Moll, Heiner Giefers, Christoph Hagleitner, Costas Bekas, and Alessandro Curioni . “Stochastic Matrix-Function Estimators: Scalable Big-Data Kernels with High Performance.” IPDPS 2016. (received Best Paper Award) 8/31/2016 7

Accelerated Stochastic Matrix-Function Estimator CPU FPGA R = zero(); for l = 1 to Ns/Nb do forall e in V do e = (rand()/RAND_MAX<0.5) ? -1.0 : 1.0; V done M0 = V W = c[0] * V // AXPY M1 = A * V // SPMM W = c[1] * M1 + W // AXPY for m = 2 to Nc do M0 = 2 * A * M1 - M0 // SPMM W = c[m] * M0 + W // AXPY pointer_swap(M0,M1) W done R += W * V T // SGEMM / DOT done E[f(A)] = R/Ns V W … 8/31/2016 8

Accelerated Stochastic Matrix-Function Estimator CPU FPGA FPGA Map the entire outer R = zero(); for l = 1 to Ns/Nb do loop onto the FPGA forall e in V do • (Almost) no host- e = (rand()/RAND_MAX<0.5) ? -1.0 : 1.0; V done device communication M0 = V • 3 sequential stages W = c[0] * V // AXPY M1 = A * V // SPMM • No double buffering W = c[1] * M1 + W // AXPY for m = 2 to Nc do needed M0 = 2 * A * M1 - M0 // SPMM • 4 asynchronous W = c[m] * M0 + W // AXPY pointer_swap(M0,M1) W kernels in inner loop done R += W * V T // SGEMM / DOT done E[f(A)] = R/Ns V … W … 8/31/2016 9

SME Architecture – Random Number Generator • xorshift64 based random number ulong2 xorshift64s (ulong x){ ulong2 res; generator to generate Rademacher x ^= x >> 12; x ^= x << 25; distribution x ^= x >> 27; res.x = x; • High quality, passes many passes res.y = x * 2685821657736338717ull; return res; many statistical tests [2] } • Well suited for FPGA implementation __kernel void rng(float *M0,*W,*V,cm, uint num, ulong seed){ • Initialize V, M0, and W on-the-fly ulong2 rngs = {rand, 0xdecafbad}; ulong rs; float rn; for(unsigned k = 0; k < num; k+=N_UNROLL){ seed cm 0 rngs = xorshift64s(rngs.x); V rs = rngs.y; #pragma unroll N_UNROLL RNG M0 for(unsigned b = 0; b < N_UNROLL; b++){ rn = ((rs >> b) & 0x1) ? -1.0 : 1.0; (incl. RHS init) W V[k+b] = rn; M0[k+b] = rn; W[k+b] = cm*rn; } [2] George Marsaglia . “ Xorshift RNGs,” Journal of Statistical Software, 2003. } 8/31/2016 10

SME Architecture: CSR Sparse Matrix Mult ltiplication 6 8 1 6 8 1 6 7 2 2 5 8 5 7 2 4 6 2 6 8 1 5 7 7 8 A 6 7 2 2 5 0 2 6 1 3 2 3 7 0 4 5 1 6 7 0 1 2 1 2 4 2 5 JA = x 8 5 7 0 3 5 8 11 14 17 20 22 IA 3 4 6 2 6 8 1 5 7 7 8 sparse matrix in CSR format sparse matrix-matrix multiplication rows rows int IA c_e JA CSR c_A c_S SpMM Reader A c_JA 128-wide float4 SIMD c_rhs • Asynchronous kernels RHS M0 M0 • AXPY Synchronization via … Prefetcher M0 FIFO channels W W $ float16 nnz rows cm 8/31/2016 11

Resource Util ilization for Kernels on Stratix-V V 5SGXA7 60 Inner loop 50 40 30 20 10 0 RNG matrix_prefetch rhs_prefetch SpMM AXPY accu_result LEs FFs RAMs DSPs 8/31/2016 12

SME on Heterogeneous System POWER8 heterogeneous node 1. Dual-socket 6-core CPU, 96 threads • IBM xlC compiler using OpenMP and Atlas BLAS 2. NVIDIA Tesla K40 GPU • CUDA 7.5 with cuBLAS • Self-developed SpMM outperforms cusparseScsrmm() 3. Nallatech PCIe-385 card w/ Altera Stratix-V FPGA • Altera OpenCL HLS 8/31/2016 13

SME – Approximation Quality on the 3 Pla latforms • Estimation quality depends on several factors • Number of test vectors • Number of terms in Chebyshev expansion • Quality of the random number generator used to initialize the test vectors • Precision of floating point operations 8/31/2016 14

Power Profiling • POWER8 On-Chip Controller (OCC) • Enables fast, scalable monitoring (ns timescale) • OCC is implemented in a POWERPC 405 • Uses continuous running, real-time OS • Monitors workload activity, chip temperature and current • Trace power consumption using Amester • Tool for out-of-band monitoring of POWER8 servers • Open sourced on github: github.com/open-power/amester • Current sensors for various domains (socket, memory buffer/DIMM, GPU, PCIe, fan, …) • Compute power consumption: 𝑄 𝑑𝑝𝑛𝑞 = 𝑄 𝑢𝑝𝑢𝑏𝑚 − 𝑄 𝑗𝑒𝑚𝑓 8/31/2016 15

Application-Level Power Traces Device reconfiguration CPU (6 threads) FPGA GPU CPU (1 thread) 8/31/2016 16

SME – Energy-Efficiency Analysis Platform Run time [s] Dynamic Power [W] Energy to Solution [kJ] CPU 172.55 143.92 24.83 Fastest CPU version (6 threads) CPU 232.31 57.01 13.24 Most efficient CPU version (1 thread) GPU 19.52 155.42 3.03 FPGA 114.00 9.13 1.04 FPGA is ~6x slower but ~3x more energy-efficient compared to the GPU CPU IBM POWER8 2-socket 12-core FPGA Nallatech PCIe-385 with Altera Stratix-V GPU NVIDIA K40 8/31/2016 17

Conclusion • Accelerators outperform the CPU. GPUs are dominant in terms of absolute performance • GPU is 12x, FPGA 2x faster than a CPU core • The compute energy for the FPGA outstanding • 3x better compared tor GPU, 13x better compared to the CPU • What about the idle power? (~550W for the system we used) • We need energy-proportional computing • Cloud: Accelerators free CPU cycles • Cloud-FPGA: Standalone, network- attached FPGA to remove “host overhead” • OpenCL increased productivity Relative Performance • Short design time, (almost) no verification • Optimization is cumbersome 8/31/2016 18

Questions? Heiner Giefers IBM Research – Zurich hgi@zurich.ibm.com 26th International Conference on Field- Programmable Logic and Applications 29th August – 2nd September 2016 SwissTech Convention Centre Lausanne, Switzerland

Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, - PowerPoint PPT Presentation

Energy-Efficient Stochastic Matrix Function Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, Raphael Polig IBM Research Zurich 26th International Conference on Field- Programmable Logic and Applications 29th August 2nd

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

One Step Studentized M -estimator M -Estimator Marek Omelka Department of Probability and

Testing proportions BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD Estimation An estimator is a

Complex models - large p, small n Shrinkage estimation Applying statistical methods to analyze

Weight Selection for a Model Weight Selection for a Model Average Estimator Average Estimator Alan

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Current Trends in Hybrid FPGA/CPU Devices Hybrid FPGA/CPU Devices Xilinx Zynq Series Real

FPGA-CAPELLA: A REAL TIME AUDIO FX UNIT COSMA KUFA AND JUSTIN XIAO WHAT IS FPGA-CAPELLA?

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

RTLinux in an FPGA Alejandro Lucero alucero@os3sl.com www.os3sl.com RTLinux in a FPGA 1.

Structured matrices in the computation of band spectra of photonic crystals Pietro Contu,

The Faber-Manteuffel Theorem and its Consequences Petr Tich joint work with Vance Faber,

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint

Quantum Diffusion and Delocalization for Random Band Matrices Antti Knowles Harvard University

EI331 Signals and Systems Lecture 17 Bo Jiang John Hopcroft Center for Computer Science

Internet Technology 5/6/2016 Session Initiation Protocol (SIP) Dominant protocol for Voice

Cuauhtmoc Carbajal Cuauhtmoc Carbajal Reference: Barry & Crowley, Modern Embedded

Renumbering Networks: RFC 4192 Fred Baker How RFC 4192 came to be l I heard one too many

Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, - PowerPoint PPT Presentation

Energy-Efficient Stochastic Matrix Function Estimator for Graph Analytics on FPGA Heiner Giefers, Peter Staar, Raphael Polig IBM Research Zurich 26th International Conference on Field- Programmable Logic and Applications 29th August 2nd

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

One Step Studentized M -estimator M -Estimator Marek Omelka Department of Probability and

Testing proportions BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD Estimation An estimator is a

Complex models - large p, small n Shrinkage estimation Applying statistical methods to analyze

Weight Selection for a Model Weight Selection for a Model Average Estimator Average Estimator Alan

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Current Trends in Hybrid FPGA/CPU Devices Hybrid FPGA/CPU Devices Xilinx Zynq Series Real

FPGA-CAPELLA: A REAL TIME AUDIO FX UNIT COSMA KUFA AND JUSTIN XIAO WHAT IS FPGA-CAPELLA?

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

RTLinux in an FPGA Alejandro Lucero alucero@os3sl.com www.os3sl.com RTLinux in a FPGA 1.

Structured matrices in the computation of band spectra of photonic crystals Pietro Contu,

The Faber-Manteuffel Theorem and its Consequences Petr Tich joint work with Vance Faber,

Cache-oblivious sparse matrixvector multiplication Albert-Jan Yzelman April 3, 2009 Joint

Quantum Diffusion and Delocalization for Random Band Matrices Antti Knowles Harvard University

EI331 Signals and Systems Lecture 17 Bo Jiang John Hopcroft Center for Computer Science

Internet Technology 5/6/2016 Session Initiation Protocol (SIP) Dominant protocol for Voice

Cuauhtmoc Carbajal Cuauhtmoc Carbajal Reference: Barry &amp; Crowley, Modern Embedded

Renumbering Networks: RFC 4192 Fred Baker How RFC 4192 came to be l I heard one too many

Cuauhtmoc Carbajal Cuauhtmoc Carbajal Reference: Barry & Crowley, Modern Embedded