On Static Timing Analysis of GPU Kernels Vesa Hirvisalo Department - PowerPoint PPT Presentation

On Static Timing Analysis of GPU Kernels Vesa Hirvisalo Department of Computer Science and Engineering Aalto University 14th International Workshop on Worst-Case Execution Time Analysis Madrid, Spain, 8th July 2014 2014-07-08

Talk outline Introduction to SIMT executed kernels ◮ Co-operating Thread Arrays (CTA) ◮ warp scheduling ◮ thread divergence Static WCET estimation ◮ divergence analysis ◮ abstract warp creation ◮ abstract CTA simulation An example ◮ based on a simple machine model WCET2014 2014-07-08 2/16

Introduction Data parallel programming and accelerators ◮ we try to maximize occupancy of the hardware GPGPU computing as an example ◮ heterogeneous computing ◮ we concentrate on the accelerator (GPU) side timing ◮ hardware scheduling essential Launches ◮ Co-operating Thread Arrays (CTA) ◮ the computation is prepared on the host (CPU) side ◮ input data and a number of threads ◮ these are launched to the accelerator (GPU) WCET2014 2014-07-08 3/16

Example (1/2): a kernel Consider the following code in a language resembling OpenCL (note the use of the thread identifier Tid ): __kernel TriangleSum(float* m, float* v, int c) { int d = 0; /* each thread has its own variables */ float s = 0; /* s is the sum to be collected */ int L = (Tid + 1) * c; for (int i = Tid; i < L; i += c) { if ((d % (Tid + 1) == 0) s += 1; if (d % 2) s += m[i]; __syncthreads(); /* assuming compiler support */ d += 1; } v[d-1] = s; } WCET2014 2014-07-08 4/16

SIMT execution Threads are processed by computing units (CU) ◮ the following we assume: a single CU ◮ able to handle a single work group (set of threads) The threads are executed in warps ◮ warp width equals to the number of cores ◮ The warp has a PC, which applies to all its unmasked threads ◮ SIMT = Single Instruction Multiple Threads ◮ there are typically several warps ◮ the warp scheduler makes the choice ◮ round-robin is typical ◮ the warp must be ready ◮ if none – the execution stalls WCET2014 2014-07-08 5/16

Small analysis windows mean few paths warps warps Faster progress an analysis window The shift gives us stall Progress time−wise Progress code−wise WCET2014 2014-07-08 6/16

Divergence in execution Program flow Active threads A B C D A = 11111111 B = 11000011 C = 00111100 D = 11111111 Time Initial stack contents After branch completion R−pc Next−pc Mask R−pc Next−pc Mask − A 11111111 − D 11111111 stack top D C 00111100 Stack after divergence R−pc Next−pc Mask After reconvergence R−pc Next−pc Mask − D 11111111 D C 00111100 − D 11111111 D B 11000011 WCET2014 2014-07-08 7/16

warp1 warp2 warp3 warp4 divergent timing WCET2014 threads proceeding on matrix 2014-07-08 8/16

WCET estimation We define the total time spent in execution as T exec = T instr + T stall Considering (structured) branching we have  T true _ branch if all threads converge to true  T if _ else = T false _ branch if all threads converge to false T false _ branch + T true _ branch if threads diverge  The warp scheduling hides the memory latencies. On the worst case we have T stall = max ( 0 , T memory − N warps ) For loops, we use the time of the longest thread in the warp. WCET2014 2014-07-08 9/16

Static divergence analysis We base our static divergence analysis on GSA. It uses three special functions: µ , γ , and η instead of the φ -function of SSA that it resembles: ◮ γ function is a join for branches. γ ( p , v 1 , v 2 ) is v 1 if the p is true (or else v 2 ). ◮ µ function is a join for loop headers. µ ( v 1 , v 2 ) is v 1 for the 1 st iteration and v 2 otherwise. ◮ η is the loop exit function η ( p , v ) . It binds a loop dependent value v to loop predicate p . We say that a definition of a variable is divergent if the value is dependent on the thread. ◮ if there are no divergent definitions for a branch predicate, we know the branch to be non-divergent. WCET2014 2014-07-08 10/16

Abstract warp construction An abstract warp A = ( V , E ) is directed graph. The nodes V have three node types: ◮ time nodes describe code regions with two values. T instr is the upper bound of the instruction execution time consumed. T shift is the upper bound of the variation of the instruction execution time caused by thread divergence. ◮ memory access nodes that mark places where memory access stalls may happen. ◮ barrier nodes that mark places where barrier synchronization must happen. An abstract warp is constructed from the code in a recursive bottom-up way WCET2014 2014-07-08 11/16

Example (2/2): CTA simulation Assuming a simple machine model (1 instr/cycle), we get the following abstract warp T = 7 T = 2 T = 7 T = 3 T = 4 instr instr instr instr instr T = 0 T = 0 T = 1 T = 0 T = 0 shift shift shift shift shift T = 4 instr T = 0 shift The abstract CTA simulation ◮ begins from the leftmost node ◮ assuming warp width = 4, we have 4 warps A final estimate T WCET = 804 ◮ a cycle accurate simulator gives 688 cycles WCET2014 2014-07-08 12/16

Conclusions Static WCET estimation ◮ divergence analysis ◮ abstract warp creation ◮ abstract CTA simulation We allow some divergence ◮ understanding divergence is essential ◮ uniform (non-divergent) execution is simpler We demonstrated an approach ◮ we used a simple machine model ◮ modeling real hardware is complex ◮ however, GPUs are rather predictable ◮ they are designed for real-time (i.e., graphics) WCET2014 2014-07-08 13/16

On Static Timing Analysis of GPU Kernels Vesa Hirvisalo Department - PowerPoint PPT Presentation

On Static Timing Analysis of GPU Kernels Vesa Hirvisalo Department of Computer Science and Engineering Aalto University 14th International Workshop on Worst-Case Execution Time Analysis Madrid, Spain, 8th July 2014 2014-07-08 Talk outline

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

GPU-Accelerated Static Timing Analysis Zizheng Guo 1 , Tsung-Wei Huang 2 , Yibo Lin 1 1 CS

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Static and Method Overloading static One per class, not per object static variables

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 Agenda Execution time analysis

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Timing and Coordination Essential Knowledge 2.E.2 and 2.E.3 Timing and Coordination Timing

Static and dynamic verification Static and dynamic V&V Software inspections Concerned

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Raising the Reliability of Estimates of Generative Performance of MRFs Yuri Burda, Fields

TSV P2P Efforts From an ISPs Perspec7ve Richard

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

A Bregman near neighbor lower bound via directed isoperimetry Amirali Abdullah Suresh

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Capacity of Continuous Channels with Memory via Directed Information Neural Estimator Ziv Aharoni

Counting authorised paths in constrained control-flow graphs Nikola K. Blanchard 1 , Siargey

Information Dynamics and Temporal Structure in Music Samer Abdallah and Mark Plumbley Centre for

On Static Timing Analysis of GPU Kernels Vesa Hirvisalo Department - PowerPoint PPT Presentation

On Static Timing Analysis of GPU Kernels Vesa Hirvisalo Department of Computer Science and Engineering Aalto University 14th International Workshop on Worst-Case Execution Time Analysis Madrid, Spain, 8th July 2014 2014-07-08 Talk outline

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

GPU-Accelerated Static Timing Analysis Zizheng Guo 1 , Tsung-Wei Huang 2 , Yibo Lin 1 1 CS

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Static and Method Overloading static One per class, not per object static variables

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 Agenda Execution time analysis

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Timing and Coordination Essential Knowledge 2.E.2 and 2.E.3 Timing and Coordination Timing

Static and dynamic verification Static and dynamic V&amp;V Software inspections Concerned

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Raising the Reliability of Estimates of Generative Performance of MRFs Yuri Burda, Fields

TSV P2P Efforts From an ISPs Perspec7ve Richard

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

A Bregman near neighbor lower bound via directed isoperimetry Amirali Abdullah Suresh

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Capacity of Continuous Channels with Memory via Directed Information Neural Estimator Ziv Aharoni

Counting authorised paths in constrained control-flow graphs Nikola K. Blanchard 1 , Siargey

Information Dynamics and Temporal Structure in Music Samer Abdallah and Mark Plumbley Centre for

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Static and dynamic verification Static and dynamic V&V Software inspections Concerned