Domain-Independent Irregular Kernels UnConventional High Performance - PowerPoint PPT Presentation

U NIVERSITY OF A C ORUÑA Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels UnConventional High Performance Computing 2010 (UCHPC) Computer Architecture Group, University of A Coruña, Spain Jacobo Lobeiras Blanco Margarita Amor López Manuel Carlos Arenaz Silva Basilio B. Fraguela Rodríguez

Presentation Overview 1. Motivation • GPU programming • Computational Kernel Analysis • Brook+ language 2. Computational kernel parallelization • Assignment • Reduction 3. Performance analysis 4. Conclusions University of A Coruña, Spain 1 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

GPU programming • The performance of GPUs is quickly evolving, however compared to CPUs, they still have many architectural restrictions and their programming model is more complex, requiring special languages and tools. 5. BrookGPU 1. HLSL 6. Brook+ 2. Cg 7. AMD Stream Profiler 3. CUDA 8. OpenCL 4. Parallel Nsight University of A Coruña, Spain 2 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

GPU programming CPU • General purpose processors, 4 core models are widely extended and their typical computing power is about 50 GFlops . • Easily programmable using standard languages like C++ or Java , with parallel standards like OpenMP and advanced debugging tools. GPU • Graphic oriented processor capable of thousands of simultaneous threads and TFlop level computing power. • Complex and hardware dependent low level programming. • Propietary high level languages like CUDA or Brook+, directive based proposals are still experimental . • Recent efforts have led to the creation of OpenCL , a standard for heterogeneous computing. University of A Coruña, Spain 3 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

GPU programming CPU GPU • • High clock speed, out of order execution High bandwidth and high throughput optimized for sequential performance. memory architecture. • • Memory architecture designed for low Small cache thanks to memory latency latency access. hiding techniques like multithreading. • • Complex processing cores and large Most of the chip area devoted to hundreds chip area devoted to cache. of small processing units. 55 nm ATI Radeon 4850 RV770 core 45 nm Intel Core 2 Penryn core 107 mm 2 260 mm 2 University of A Coruña, Spain 4 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

GPU programming Radeon 4850 architecture diagram - 10 SIMD modules - 16 SPs per SIMD module, as well as a texture unit and 16 KB shared memory - Each SP has 5 processing elements, however only the T unit supports FP64 - Four 64-bit memory channels University of A Coruña, Spain 5 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

GPU programming • The parallelization of applications in GPUs is a complex task that usually requires specially designed algorithms to be able to exploit their advantage in computational power, the straightforward approach tends to provide little performance benefit. Low sequential Limited shared and divergent memory code performance S M S M S M PCI-E connection Coalescence bandwidth and issues latency University of A Coruña, Spain 6 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

Motivation • There are a series of CPU well-known general parallelization strategies that can be applied to many problems, studying how they can be adapted for GPUs to reduce development time and effort has great interest. parallel for parallel reduction parallel scan parallel recursion 0 1 2 3 0 1 2 3 1 2 3 4 36 0 1 4 9 1 5 1 3 3 7 9 4 6 7 6 3 3 2 2 13 University of A Coruña, Spain 7 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

Computational Kernel Analysis • In this work we refer to computational kernel as a regular code pattern frecuently used in programs. We use Domain-independent concept-level computational kernels , that enable the recognition of code patterns with independence of the programming language. • Computational kernel analysis is a tool that enables the identification of potential code parallelism without a depth knowledge of the underlying algorithms. • Computational kernels can be classified in several families, depending on their characteristics and memory access patterns. a) Assignments b) Inductions c) Reductions d) Masks e) Recurrences f) Reinitializations University of A Coruña, Spain 8 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

Brook+ language Brook+ is based on C, but following a SIMD streaming paradigm. • Data resides on structures similar to arrays called streams . • In each invocation a kernel is executed over the whole domain in parallel, creating a thread for each element of the output. • By default Brook+ uses cached memory reads so coalescence is not an issue, however each thread can only write to a certain location of the output stream. • The language also supports reductions to perform collective operations, like finding the maximum element of a set. kernel University of A Coruña, Spain 9 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

Brook+ language University of A Coruña, Spain 10 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

Presentation Overview 1. Motivation • GPU programming • Computational Kernel Analysis • Brook+ language 2. Computational kernel parallelization • Assignment • Reduction 3. Performance analysis 4. Conclusions University of A Coruña, Spain 11 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

Assignment Assignment is the simplest kernel, it stores a value in the specified memory address. 1. If the value to write is a scalar, like the evaluation of a expression or the solution of a equation, it is called scalar assignment . 2. If the memory is accessed through an indexed variable, like an array, and the access patern can be expressed as a linear, polynomial or geometric function, it is a regular assignment . 3. If the memory is accessed through an indexed variable, but the access pattern is irregular or unknown at compile time, it is called irregular assignment . SCALAR REGULAR IRREGULAR University of A Coruña, Spain 12 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

Irregular assignment The proposed solution to keep the equivalence is based on an inspector-executor strategy: 1. An inspector function performs an analysis of the indirections access pattern. This analysis is normally computed by a single processor. 2. An executor function uses the vector generated by the inspector analysis to distribute the iterations among the processors ensuring that there will be no write conflicts. Another advantege of this technique is that it enables runtime dynamic dead code elemination. Inspector indirection analysis Executor parallel processing indirection inspector 1 1 2 2 3 4 3 0 6 6 5 5 University of A Coruña, Spain 13 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

Irregular assignment The proposed solution to keep the equivalence is based on an inspector-executor strategy: 1. An inspector function performs an analysis of the indirections access pattern. This analysis is normally computed by a single processor. 2. An executor function uses the vector generated by the inspector analysis to distribute the iterations among the processors ensuring that there will be no write conflicts. Another advantege of this technique is that it enables runtime dynamic dead code elemination. Inspector indirection analysis Executor parallel processing indirection inspector source inspector target 1 1 1.0 1 1.0 P1 2 2 2.0 2 2.0 3 4 3.0 4 3.0 3 0 4.0 0 0.0 P2 6 6 5.0 6 6.0 5 5 6.0 5 5.0 University of A Coruña, Spain 13 / 25 Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels

Domain-Independent Irregular Kernels UnConventional High Performance - PowerPoint PPT Presentation

U NIVERSITY OF A C ORUA Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels UnConventional High Performance Computing 2010 (UCHPC) Computer Architecture Group, University of A Corua, Spain Jacobo Lobeiras Blanco

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Irregular Migration, Human Irregular Migration, Human Smuggling and Informal Smuggling and

Analyzing Irregular Mutual Analyzing Irregular Mutual Exclusion in Parallel Programs Exclusion

Concatenated Irregular Variable Length Coding and Irregular Unity Rate Coding R. G. Maunder and

Domain-independent planning and Domain-dependent planning Le Meilleur est lennemi

On enumerating the kernels in a bipolar valued digraph Raymond Bisdorff University of Luxembourg

Kernel on Automata Cousins of String Kernels and Dynamic Systems Kernels? S.V.N. Vishy

Launching Kernels Dr Eric McCreath Research School of Computer Science The Australian National

Modelling covariance kernels for nonstationary random fields Christopher G. Small University of

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

BPF: Tracing and More Brendan Gregg Senior Performance Architect Ye Olde BPF Berkeley Packet

Reasoning about Computational Systems using Abella Kaustuv Chaudhuri 1 Gopalan Nadathur 2 1 Inria

Banco De Vdeo Broadcast Video Archive Rui Ribeiro Rui Ribeiro FCCN 31 de Maro 2011 I FCCN

Property-Based Testing via Proof Reconstruction Work-in-progress Alberto Momigliano joint work

TWODIMENSIONAL ANALYSIS OF THE Turkey Francisco J. CRYSTALLIZATION OF HOLLOW Blanco

rt ts r t

On Vacuum Stability without Supersymmetry Brane dynamics, bubbles and holography Ivano Basile |

A Two-Stage Framework for Computing Entity Relatedness in Wikipedia Marco Ponza, Paolo Ferragina