Generating Efficient Data Movement Code for Heterogeneous - PowerPoint PPT Presentation

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation Indian Institute of Science {roshan,chandan.g,thejas,uday}@csa.iisc.ernet.in September 11, 2013 Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 1 / 61

Parallelizing code for distributed-memory architectures OpenMP code for shared-memory systems: MPI code for distributed-memory systems: for ( i =1; i<=N; i++) { for ( i =1; i<=N; i++) { set_of_j_s = dist (1, N, processor_id ); #pragma omp parallel for for each j in set_of_j_s { for ( j =1; j<=N; j++) { <computation> <computation> } } <communication> } } Explicit communication is required between: devices in a heterogeneous system with CPUs and multiple GPUs. nodes in a distributed-memory cluster. Hence, tedious to program. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 2 / 61

Affine loop nests Arbitrarily nested loops with affine bounds and affine accesses. Form the compute-intensive core of scientific computations like: stencil style computations, linear algebra kernels, alternating direction implicit (ADI) integrations. Can be analyzed by the polyhedral model. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 3 / 61

Example iteration space Dependence (1,0) i Dependence (0,1) Dependence (1,1) j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 4 / 61

Example iteration space Dependence (1,0) Tiles i Dependence (0,1) Dependence (1,1) j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 5 / 61

Example iteration space Dependence (1,0) Tiles Parallel phases i Dependence (0,1) Dependence (1,1) j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 6 / 61

Automatic Data Movement For affine loop nests: Statically determine data to be transferred between computes devices. with a goal to move only those values that need to be moved to preserve program semantics. Generate data movement code that is: parametric in problem size symbols and number of compute devices. valid for any computation placement. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 7 / 61

Communication is parameterized on a tile Tile represents an iteration of the innermost distributed loop. May or may not be the result of loop tiling. A tile is executed atomically by a compute device. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 8 / 61

Existing flow-out (FO) scheme Dependence (1,0) Tiles Flow-out set i Dependence (0,1) Dependence (1,1) Flow-out set: The values that need to be communicated to other tiles. Union of per-dependence flow-out sets of all RA W dependences. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 9 / 61

Existing fl ow-out (FO) scheme Dependence (1,0) Tiles Flow-out set i Dependence (0,1) Dependence (1,1) Receiving tiles: The set of tiles that require the fl ow-out set. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 10 / 61

Existing fl ow-out (FO) scheme Dependence (1,0) Tiles Flow-out set i Dependence (0,1) Communcation Dependence (1,1) All elements in the fl ow-out set might not be required by all its receiving tiles. Only ensures that the receiver requires at least one element in the communicated set. Could transfer unnecessary elements. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 11 / 61

Our first scheme Motivation: All elements in the data communicated should be required by the receiver. Key idea: Determine data that needs to be sent from one tile to another, parameterized on a sending tile and a receiving tile. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 12 / 61

Flow-in (FI) set Dependence (1,0) Tiles Flow-in set i Dependence (0,1) Dependence (1,1) Flow-in set: The values that need to be received from other tiles. Union of per-dependence fl ow-in sets of all RAW dependences. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 13 / 61

Flow-out intersection fl ow-in (FOIFI) scheme Dependence (1,0) Tiles Flow set i Dependence (0,1) Dependence (1,1) Flow set: Parameterized on two tiles. The values that need to be communicated from a sending tile to a receiving tile. Intersection of the fl ow-out set of the sending tile and the fl ow-in set of the receiving tile. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 14 / 61

Flow-out intersection fl ow-in (FOIFI) scheme Dependence (1,0) Tiles Flow set i Dependence (0,1) Communcation Dependence (1,1) Precise communication when each receiving tile is executed by a di ff erent compute device. Could lead to huge duplication when multiple receiving tiles are executed by the same compute device. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 17 / 61

Comparison with virtual processor based schemes Some existing schemes use a virtual processor to physical mapping to handle symbolic problem sizes and number of compute devices. Tiles can be considered as virtual processors in FOIFI. Lesser redundant communication in FOIFI than prior works that use virtual processors since it: uses exact-data fl ow information. combines data to be moved due to multiple dependences. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 18 / 61

Our main scheme Motivation: Partitioning the communication set such that all elements within each partition is required by all receivers of that partition. Key idea: Partition the dependences in a particular way, and determine communication sets and their receivers based on those partitions. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 19 / 61

Flow-out partitioning (FOP) scheme Dependence (1,0) Tiles Flow-out partitions i Dependence (0,1) Source-distinct partitioning of Dependence (1,1) dependences - partitions dependences such that: all dependences in a partition communicate the same set of values. any two dependences in di ff erent partitions communicate disjoint set of values. Determine communication set and receiving tiles for each partition. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 20 / 61

Flow-out partitioning (FOP) scheme Dependence (1,0) Tiles Flow-out partitions i Dependence (0,1) Dependence (1,1) Communication sets of di ff erent partitions are disjoint. Union of communication sets of all partitions yields the fl ow-out set. Hence, the fl ow-out set of a tile is partitioned. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 21 / 61

Source-distinct partitioning of dependences i Tiles Dependence (1,1) Partitions of dependences Initially, each dependence is: restricted to those constraints which are inter-tile, and put in a separate partition. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 22 / 61

Source-distinct partitioning of dependences Dependence (1,0) Tiles i Partitions of dependences Initially, each dependence is: restricted to those constraints which are inter-tile, and put in a separate partition. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 23 / 61

Source-distinct partitioning of dependences i Tiles Dependence (0,1) Partitions of dependences Initially, each dependence is: restricted to those constraints which are inter-tile, and put in a separate partition. j Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 24 / 61

Source-distinct partitioning of dependences For all pairs of dependences in two Dependence (1,0) Tiles i partitions: Dependence (1,1) Partitions of dependences Find the source iterations that access the same region of data - source-identical. Get new dependences by restricting the original dependences to the source-identical iterations. Subtract out the new dependences from the original dependences. The set of new dependences j formed is a new partition. Multicore Computing Lab (CSA, IISc) Automatic Data Movement September 11, 2013 25 / 61

Generating Efficient Data Movement Code for Heterogeneous - PowerPoint PPT Presentation

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation Indian Institute of Science

H2 F2009 H2 F2009 GENERATING GENERATING GENERATING GENERATING FREE CASH FLOW FREE CASH FLOW

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Advanced Electric Generating Advanced Electric Generating Advanced Electric Generating

Ratchaburi Electricity Generating Holding PCL. Ratchaburi Electricity Generating Holding PCL.

Recursive Definitions Generating Functions Lecture 18 Generating Functions A generating

Generating Verifiable Java Code from Verified PVS Specifications NFM2012 Generating Verifiable

Generating Serialisation Code with Clang EURO-LLVM CONFERENCE 12 th April 2012 Wayne Palmer

Prominence-based licensing in head movement and phrasal movement Brian Hsu LSA 2020 Annual

DIFFICULTIES IN CHILDREN Anna Barnett Everyday movement skills Everyday movement skills

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Generating Subfields Mark van Hoeij June 15, 2017 Mark van Hoeij Generating Subfields Overview

Atikokan Generating Station Thunder Bay Generating Station March 5, 2013 Alberta Biomaterials

INF5110 Compiler Construction Spring 2016 1 / 98 Outline 1. Intermediate code generation

INF5110 Compiler Construction Spring 2017 1 / 97 Outline 1. Intermediate code generation

Efficient Rendering of Heterogeneous Polydisperse Granular Materials Thomas Mller 1,2 Marios

Drasil: From generating code to generating software Jacques Carette, Spencer Smith, Dan Szymczak

IV. Adiabatic Processes IV. Adiabatic Processes If a material undergoes a change in its physical

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz

Motivation: Environmental Sensing Relative humidity and temperature (RH/T) monitoring

Why is it a Techy Disease? In normal physiology, glucose metabolism is exquisitely controlled; Type

CoMerge Toward Efficient Data Placement in Shared Heterogeneous Memory Systems Thaleia Dimitra

Model Interpretation Danish Pruthi April 28, 2020 Why interpretability? Task:

Public Key Cryptography Introduction Foundation of todays secure communication Allows

Quarterly D&O Claim Trends: End of Year Wrap-Up 1 ABOUT ADVISEN Advisen Ltd. is a

Generating Efficient Data Movement Code for Heterogeneous - PowerPoint PPT Presentation

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Chandan Reddy Thejas Ramashekar Uday Bondhugula Department of Computer Science and Automation Indian Institute of Science

H2 F2009 H2 F2009 GENERATING GENERATING GENERATING GENERATING FREE CASH FLOW FREE CASH FLOW

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Advanced Electric Generating Advanced Electric Generating Advanced Electric Generating

Ratchaburi Electricity Generating Holding PCL. Ratchaburi Electricity Generating Holding PCL.

Recursive Definitions Generating Functions Lecture 18 Generating Functions A generating

Generating Verifiable Java Code from Verified PVS Specifications NFM2012 Generating Verifiable

Generating Serialisation Code with Clang EURO-LLVM CONFERENCE 12 th April 2012 Wayne Palmer

Prominence-based licensing in head movement and phrasal movement Brian Hsu LSA 2020 Annual

DIFFICULTIES IN CHILDREN Anna Barnett Everyday movement skills Everyday movement skills

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

Generating Subfields Mark van Hoeij June 15, 2017 Mark van Hoeij Generating Subfields Overview

Atikokan Generating Station Thunder Bay Generating Station March 5, 2013 Alberta Biomaterials

INF5110 Compiler Construction Spring 2016 1 / 98 Outline 1. Intermediate code generation

INF5110 Compiler Construction Spring 2017 1 / 97 Outline 1. Intermediate code generation

Efficient Rendering of Heterogeneous Polydisperse Granular Materials Thomas Mller 1,2 Marios

Drasil: From generating code to generating software Jacques Carette, Spencer Smith, Dan Szymczak

IV. Adiabatic Processes IV. Adiabatic Processes If a material undergoes a change in its physical

Encoding of Phonology in an RNN model of Grounded Speech Afra Alishahi, Marie Barking, Grzegorz

Motivation: Environmental Sensing Relative humidity and temperature (RH/T) monitoring

Why is it a Techy Disease? In normal physiology, glucose metabolism is exquisitely controlled; Type

CoMerge Toward Efficient Data Placement in Shared Heterogeneous Memory Systems Thaleia Dimitra

Model Interpretation Danish Pruthi April 28, 2020 Why interpretability? Task:

Public Key Cryptography Introduction Foundation of todays secure communication Allows

Quarterly D&amp;O Claim Trends: End of Year Wrap-Up 1 ABOUT ADVISEN Advisen Ltd. is a

Quarterly D&O Claim Trends: End of Year Wrap-Up 1 ABOUT ADVISEN Advisen Ltd. is a