Programming Models and Runtime Systems for Heterogeneous - PowerPoint PPT Presentation

Programming Models and Runtime Systems for Heterogeneous Architectures Sylvain Henry sylvain.henry@inria.fr Advisors: Denis Barthou and Alexandre Denis November 14, 2013 1

High-Performance Computing Sources: Dassault aviation, BMW, Larousse, Interstices 2

Evolution of the architecture models Parallel architectures � Single-core architecture improvement stalled since 2003 � Power wall: increasing the processor frequency exponentially increases power consumption � Memory wall: increasing gap between memory and processor speeds 3

Evolution of the architecture models Parallel architectures � Single-core architecture improvement stalled since 2003 � Power wall: increasing the processor frequency exponentially increases power consumption � Memory wall: increasing gap between memory and processor speeds � The number of transistors on a chip keeps increasing � Increase in the number of cores per chip � Multi-core architectures are omnipresent 3

Evolution of the architecture models Parallel architectures � Single-core architecture improvement stalled since 2003 � Power wall: increasing the processor frequency exponentially increases power consumption � Memory wall: increasing gap between memory and processor speeds � The number of transistors on a chip keeps increasing � Increase in the number of cores per chip � Multi-core architectures are omnipresent � Trend � Multi-core with lower frequencies and more cores 3

Evolution of the architecture models Specialized parallel architectures � Cell Broadband Engine (2005) � 8 co-processors � Used in PlayStation 3 and in super-computers 4

Evolution of the architecture models Specialized parallel architectures � Cell Broadband Engine (2005) � 8 co-processors � Used in PlayStation 3 and in super-computers � Graphics Processing Units (GPU) � Massively parallel architectures � Used to perform scientific computations 4

Evolution of the architecture models Specialized parallel architectures � Cell Broadband Engine (2005) � 8 co-processors � Used in PlayStation 3 and in super-computers � Graphics Processing Units (GPU) � Massively parallel architectures � Used to perform scientific computations � System-on-chip (SoC) � e.g. ARM, AMD Fusion � Integrated CPU, GPU, DSP. . . 4

Evolution of the architecture models Specialized parallel architectures � Cell Broadband Engine (2005) � 8 co-processors � Used in PlayStation 3 and in super-computers � Graphics Processing Units (GPU) � Massively parallel architectures � Used to perform scientific computations � System-on-chip (SoC) � e.g. ARM, AMD Fusion � Integrated CPU, GPU, DSP. . . � Trend: heterogeneous architectures � Composition of different architecture models 4

Heterogeneous architectures � Multi-core (CPU) + several accelerators � Most general case � Any number of accelerators � Any kind of accelerator � Any kind of interconnection network � Examples: 5

Heterogeneous architectures � Multi-core (CPU) + several accelerators � Most general case � Any number of accelerators � Any kind of accelerator � Any kind of interconnection network � Examples: � Use best suited processing unit for each computation � Manual tuning has to be repeated for each architecture � Code portability difficult to achieve 5

Abstract architecture model Memory CPU MIC GPU Memory Memory ... ... Memory Memory ... 6

Abstract architecture model Host PU PU PU PU cpu cpu cpu cpu Memory Network of memories... Memory Memory Memory ... PU PU PU PU PU cuda OpenCL mic mic mic ...with associated heterogeneous processing units 6

Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory Memory Memory Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7

Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B Memory Memory Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7

Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B Memory Memory A B Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7

Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B Memory Memory A B C Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7

Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B C Memory Memory A B C Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7

Execution model Host Master-slave Host program model PU PU PU PU cpu cpu cpu cpu Memory A B C Memory Memory Memory ... PU PU PU PU PU cuda OpenCL mic mic mic 7

Programming model Low-level approach (e.g. OpenCL, CUDA. . . ) Host PU PU PU PU cpu cpu cpu cpu Memory Memory Memory Memory PU PU PU gpu gpu acc Device 1 Device 2 Device 3 Context 1 Context 2 8

Programming model Low-level approach (e.g. OpenCL, CUDA. . . ) Host PU PU PU PU cpu cpu cpu cpu Command submission Memory Per device command queues Memory Memory Memory PU PU PU gpu gpu acc Device 1 Device 2 Device 3 Context 1 Context 2 8

Programming model Low-level approach (e.g. OpenCL, CUDA. . . ) Host PU PU PU PU OpenCL cpu cpu cpu cpu Command graph Command Host submission Device 1 Device N Memory Tr Tr Per device command queues K ... Tr Callback Memory Memory Memory PU PU PU gpu gpu acc ... ... Device 1 Device 2 Device 3 Context 1 Context 2 8

OpenCL example (uncluttered) C ← A + B float A[256], B[256], C[256]; clGetPlatformIDs(&platforms ...); Select accelerator clGetDeviceIDs(platforms [0], &devices ...); cl_context context = clCreateContext(devices ...); cl_command_queue cq = clCreateCommandQueue(context, devices[0]...); cl_mem bufA = clCreateBuffer(context, 1024...); Allocate buffers cl_mem bufB = clCreateBuffer(context, 1024...); cl_mem bufC = clCreateBuffer(context, 1024...); clEnqueueWriteBuffer(cq, bufA, 0, 1024, A, NULL, &event1...); Send data clEnqueueWriteBuffer(cq, bufB, 0, 1024, B, NULL, &event2...); clSetKernelArg(kernelAdd, 0, sizeof (cl_mem), &bufA); Execute kernel clSetKernelArg(kernelAdd, 1, sizeof (cl_mem), &bufB); clSetKernelArg(kernelAdd, 2, sizeof (cl_mem), &bufC); cl_event deps[] = {event1,event2}; clEnqueueNDRangeKernel(cq, kernelAdd, deps, &event3...); Receive data clEnqueueReadBuffer(cq, bufC, 0, 1024, C, &event3, &event4); clWaitForEvents(event4); Release buffers clReleaseMemObject(bufA); clReleaseMemObject(bufB); clReleaseMemObject(bufC); 9

Programming Models and Runtime Systems for Heterogeneous - PowerPoint PPT Presentation

Programming Models and Runtime Systems for Heterogeneous Architectures Sylvain Henry sylvain.henry@inria.fr Advisors: Denis Barthou and Alexandre Denis November 14, 2013 1 High-Performance Computing Sources: Dassault aviation, BMW,

Task scheduling over Heterogeneous Multicore Machines: a Runtime Perspective Raymond Namyst

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Runtime systems Runtime systems Functional program are very high-level: its not obvious how to

Marawacc: A Framework for Heterogeneous Computing in Java Motivation Marawacc-API Runtime Code

Runtime System COMP 524: Programming Languages Based in part on slides and notes by J. Erickson,

Characteristics of Adapti tive Runtime Systems in HPC Laxmikant (Sanjay) Kale

Programming Distributed Systems Programming Models for Distributed Systems Annette Bieniusa FB

Reactive Runtime Systems for Heterogeneous Extreme Computations Thomas Sterling Chief Scientist,

Towards a multifrontal QR factorization for heterogeneous architectures over runtime systems

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

301AA - Advanced Programming Lecturer: Andrea Corradini andrea@di.unipi.it

Solid Type System vs Runtime Checks and Unit Tests Vladimir Pavkin Plan Fail Fast concept

Using Adaptation Plans to Control the Behavior of Models@Runtime Maksym Lushpenko, Nicolas Ferry,

Runtime Considerations Were moving towards actually producing target code. This means we need

ENVIRONMENTAL GEOTECHNICS CE-488 Lecture No. 18 Prof. D N Singh Department of Civil Engineering

5/25/2016 T he Gua rdia n a d L ite m Wo rking with the Yo ung Child Ne b ra ska Yo ung

Automatic Disfluency Automatic Disfluency Detection in Multi-party Detection in Multi-party

Writing Effective Language Acquisition Plans 2017 SD State-Wide Title III Consortium Black Hills

Psychological Abuse NCEA Elder Abuse Presentation: Psychological Abuse www.ncea.aoa.gov 1

Learning with Structured Output Spaces Keerthiram Murugesan Standard Predic,on Find func8on

Structured Output Learning with Indirect Supervision Ming-Wei Chang , Vivek Srikumar, Dan

Functions Function: Unit of operation Functions o A series of statements grouped together