Automatic Data Allocation, Buffer Management and Data movement for - - PowerPoint PPT Presentation

automatic data allocation buffer management and data
SMART_READER_LITE
LIVE PREVIEW

Automatic Data Allocation, Buffer Management and Data movement for - - PowerPoint PPT Presentation

Automatic Data Allocation, Buffer Management and Data movement for Multi-GPU Machines Thejas Ramashekar MSc Engg ( Thesis Defence ) Advisor: Dr. Uday Bondhugula Indian Institute of Science A Typical HPC Setup CPU CPU GPU1 GPU1 GPU2


slide-1
SLIDE 1

Automatic Data Allocation, Buffer Management and Data movement for Multi-GPU Machines

Thejas Ramashekar MSc Engg ( Thesis Defence ) Advisor: Dr. Uday Bondhugula Indian Institute of Science

slide-2
SLIDE 2

CPU

GPU1 GPU2

Network

GPU N

CPU

GPU1 GPU2 GPU N

CPU

GPU1 GPU2 GPU N

CPU

GPU1 GPU2 GPU N

North Bridge DDR RAM

A Typical HPC Setup

slide-3
SLIDE 3

Multi-GPU Machine

North Bridge DDR RAM CPU

GPU1 GPU2 GPU N

slide-4
SLIDE 4

Multi-GPU Setup - Key properties

  • Distributed memory architecture
slide-5
SLIDE 5

Multi-GPU Setup - Key properties

  • Distributed memory architecture
  • Limited GPU memory (512 MB to 6 GB)
slide-6
SLIDE 6

Multi-GPU Setup - Key properties

  • Distributed memory architecture
  • Limited GPU memory (512 MB to 6 GB)
  • Limited PCIex bandwidth (Max 8 GB/s)
slide-7
SLIDE 7

Affine loop nests

  • Loop nests which have affine bounds and

the array access functions in the computation statements are affine functions

  • f outer loop iterators and program

parameters

slide-8
SLIDE 8

Affine loop nests

  • Loop nests which have affine bounds and

the array access functions in the computation statements are affine functions

  • f outer loop iterators and program

parameters

  • eg: stencils, linear-algebra kernels, dynamic

programming codes, data mining applications

slide-9
SLIDE 9

Affine loop nests

  • Loop nests which have affine bounds and the

array access functions in the computation statements are affine functions of outer loop iterators and program parameters

  • eg: stencils, linear-algebra kernels, dynamic

programming codes, data mining applications

  • eg: Floyd-Warshall

affine bounds affine access function

slide-10
SLIDE 10

Running an affine loop nest on multi-GPU machine

Extract parallelism and tile Distribute tiles among the GPUs Perform computations Allocate data for each Tile Perform inter-GPU coherency serial dimension parallel dimension

Next serial iteration

Serial C program containing one or more affine loop nests

slide-11
SLIDE 11

Structure of an affine loop nest for multi-GPU machine

slide-12
SLIDE 12

The need for a multi-GPU memory manager

  • Manual programming of multi-GPU systems

is tedious, error-prone and time consuming

slide-13
SLIDE 13

The need for a multi-GPU memory manager

  • Manual programming of multi-GPU systems

is tedious, error-prone and time consuming

  • Existing works are either:

○ Manual application specific techniques

  • r

○ Have inefficiencies in terms of data allocation sizes, reuse exploitation, inter-GPU coherency etc

slide-14
SLIDE 14

Design goals for a multi-GPU memory manager

  • The desired abilities for a multi-GPU

memory manager are:

slide-15
SLIDE 15

Design goals for a multi-GPU memory manager

  • The desired abilities for a multi-GPU

memory manager are:

○ To identify and minimize data allocation sizes

slide-16
SLIDE 16

Design goals for a multi-GPU memory manager

  • The desired abilities for a multi-GPU

memory manager are:

○ To identify and minimize data allocation sizes ○ To reuse data already present on the GPU

slide-17
SLIDE 17

Design goals for a multi-GPU memory manager

  • The desired abilities for a multi-GPU

memory manager are:

○ To identify and minimize data allocation sizes ○ To reuse data already present on the GPU ○ To keep data transfers minimal and efficient

slide-18
SLIDE 18

Design goals for a multi-GPU memory manager

  • The desired abilities for a multi-GPU

memory manager are:

○ To identify and minimize data allocation sizes ○ To reuse data already present on the GPU ○ To keep data transfers minimal and efficient ○ To achieve all the above with minimal overhead

slide-19
SLIDE 19

Bounding Boxes

  • Bounding box of an access function, is the

smallest hyper-rectangle that encapsulates all the array elements accessed by it

slide-20
SLIDE 20

Bounding Boxes

  • Bounding box of an access function, is the

smallest hyper-rectangle that encapsulates all the array elements accessed by it

slide-21
SLIDE 21

Bounding Boxes

  • Bounding box of an access function, is the

smallest hyper-rectangle that encapsulates all the array elements accessed by it

slide-22
SLIDE 22

Bounding Boxes

  • Bounding box of an access function, is the

smallest hyper-rectangle that encapsulates all the array elements accessed by it

slide-23
SLIDE 23

Bounding Boxes

  • Bounding box of an access function, is the

smallest hyper-rectangle that encapsulates all the array elements accessed by it

slide-24
SLIDE 24

Key insights on bounding boxes

  • Two key insights:
slide-25
SLIDE 25

Key insights on bounding boxes

  • Two key insights:

○ Bounding boxes can be subjected to standard set

  • perations at runtime with negligible overhead
slide-26
SLIDE 26

Key insights on bounding boxes

  • Two key insights:

○ Bounding boxes can be subjected to standard set

  • perations at runtime with negligible overhead

○ GPUs have architectural support for fast rectangular copies

slide-27
SLIDE 27

Set Operations on Bounding Boxes

slide-28
SLIDE 28

Set Operations on Bounding Boxes

slide-29
SLIDE 29

Set Operations on Bounding Boxes

slide-30
SLIDE 30

Set Operations on Bounding Boxes

slide-31
SLIDE 31

Set Operations on Bounding Boxes

slide-32
SLIDE 32

Set Operations on Bounding Boxes

Negligible runtime overhead

slide-33
SLIDE 33

Architectural support for rectangular transfers

  • Architectural support for

rectangular transfers on GPU

  • Support from

programming models such as OpenCL and CUDA

eg: clEnqueueReadBufferRect() and clEnqueueWriteBufferRect()

slide-34
SLIDE 34

The Bounding Box based memory manager (BBMM)

  • Compiler-assisted runtime scheme
slide-35
SLIDE 35

The Bounding Box based memory manager (BBMM)

  • Compiler-assisted runtime scheme
  • Compile-time uses static analysis to identify

regions of data accessed by a loop nest in terms of bounding boxes

slide-36
SLIDE 36

The Bounding Box based memory manager (BBMM)

  • Compiler-assisted runtime scheme
  • Compile-time uses static analysis to identify

regions of data accessed by a loop nest in terms of bounding boxes

  • Runtime refines these initial bounding boxes

into a set of disjoint bounding boxes

slide-37
SLIDE 37

The Bounding Box based memory manager (BBMM)

  • Compiler-assisted runtime scheme
  • Compile-time uses static analysis to identify

regions of data accessed by a loop nest in terms of bounding boxes

  • Runtime refines these initial bounding boxes

into a set of disjoint bounding boxes

  • All data transfers are done in terms of

bounding boxes

slide-38
SLIDE 38

Overview of BBMM

slide-39
SLIDE 39

Data allocation scheme

slide-40
SLIDE 40

Buffer Management

  • Two lists per GPU

○ inuse list ○ unused list

  • Each bounding box has

an associated usage count

  • Flags to indicate read-
  • nly/read-write etc
slide-41
SLIDE 41

Important features of the Buffer Manager

  • Inter-tile data reuse

○ Reuse data already present on the GPU

  • Box-in/box-out

○ Ability to make space on the GPU when it runs out of memory

slide-42
SLIDE 42

Inter-GPU coherency

  • Based on our previous work:

Roshan Dathathri, Chandan Reddy, Thejas Ramashekar, and Uday

  • Bondhugula. Generating Efficient Data Movement Code for

Heterogeneous Architectures with Distributed Memory. In ACM PACT 2013.

slide-43
SLIDE 43

Inter-GPU coherency

  • Based on our previous work:

Roshan Dathathri, Chandan Reddy, Thejas Ramashekar, and Uday

  • Bondhugula. Generating Efficient Data Movement Code for

Heterogeneous Architectures with Distributed Memory. In ACM PACT 2013.

  • Identify the data to be communicated from a source tile

due to flow (RAW) dependences called the Flow-out set

slide-44
SLIDE 44

Inter-GPU coherency

  • Based on our previous work:

Roshan Dathathri, Chandan Reddy, Thejas Ramashekar, and Uday

  • Bondhugula. Generating Efficient Data Movement Code for

Heterogeneous Architectures with Distributed Memory. In ACM PACT 2013.

  • Identify the data to be communicated from a source tile

due to flow (RAW) dependences called the Flow-out set

  • Further refine the Flow-out set using a technique called

source-distinct-partitioning

slide-45
SLIDE 45

Inter-GPU coherency

  • Based on our previous work:

Roshan Dathathri, Chandan Reddy, Thejas Ramashekar, and Uday

  • Bondhugula. Generating Efficient Data Movement Code for

Heterogeneous Architectures with Distributed Memory. In ACM PACT 2013.

  • Identify the data to be communicated from a source tile

due to flow (RAW) dependences called the Flow-out set

  • Further refine the Flow-out set using a technique called

source-distinct-partitioning

  • Eliminates both unnecessary and duplicate data

transfers

slide-46
SLIDE 46

Inter-GPU coherency

  • Based on our previous work:

Roshan Dathathri, Chandan Reddy, Thejas Ramashekar, and Uday

  • Bondhugula. Generating Efficient Data Movement Code for

Heterogeneous Architectures with Distributed Memory. In ACM PACT 2013.

  • Identify the data to be communicated from a source tile

due to flow (RAW) dependences called the Flow-out set

  • Further refine the Flow-out set using a technique called

source-distinct-partitioning

  • Eliminates both unnecessary and duplicate data

transfers

  • The scheme has been demonstrated to work well on

both distributed memory and heterogeneous systems

slide-47
SLIDE 47

Inter-GPU coherency (cont)

N=8, k=1

CPU’s copy communication set for k=1 Tile1 executed

  • n GPU1

Tile2 executed

  • n GPU2

Data for Tile2 in k=1 Data for Tile1 in k=1

  • BBMM extracts the flow-out

sets as flow-out bounding boxes

slide-48
SLIDE 48

Inter-GPU coherency (cont)

N=8, k=1

CPU’s copy communication set for k=1 Tile1 executed

  • n GPU1

Tile2 executed

  • n GPU2

Data for Tile2 in k=1 Data for Tile1 in k=1

  • BBMM extracts the flow-out

sets as flow-out bounding boxes

  • The flow-out bounding box of

a tile is copied out from the source GPU onto the host CPU

slide-49
SLIDE 49

Inter-GPU coherency (cont)

N=8, k=1

CPU’s copy communication set for k=1 Tile1 executed

  • n GPU1

Tile2 executed

  • n GPU2

Data for Tile2 in k=1 Data for Tile1 in k=1

  • BBMM extracts the flow-out

sets as flow-out bounding boxes

  • The flow-out bounding box of

a tile is copied out from the source GPU onto the host CPU

  • If any other GPU contains

the same bounding box, it is updated with a flow-in transfer

  • If no GPU currently has that

bounding box, the updated data is retained on the CPU

slide-50
SLIDE 50

Implementation

  • The compile-time component integrated into polyhedral

source-to-source transformer - Pluto

slide-51
SLIDE 51

Implementation

  • The compile-time component integrated into polyhedral

source-to-source transformer - Pluto

  • The input to the compile-time is the sequential C code

containing a set of affine loop nests

slide-52
SLIDE 52

Implementation

  • The compile-time component integrated into polyhedral

source-to-source transformer - Pluto

  • The input to the compile-time is the sequential C code

containing a set of affine loop nests

  • Pluto creates a tiled and parallelized version of the input

code

slide-53
SLIDE 53

Implementation

  • The compile-time component integrated into polyhedral

source-to-source transformer - Pluto

  • The input to the compile-time is the sequential C code

containing a set of affine loop nests

  • Pluto creates a tiled and parallelized version of the input

code

  • BBMM’s compile-time component takes this tiled and

parallelized code as input and generates the following: ○ A set of initial and flow-out bounding boxes ○ The code similar to the host code structure shown in Algorithm 4.

slide-54
SLIDE 54

Implementation

  • The compile-time component integrated into polyhedral

source-to-source transformer - Pluto

  • The input to the compile-time is the sequential C code

containing a set of affine loop nests

  • Pluto creates a tiled and parallelized version of the input

code

  • BBMM’s compile-time component takes this tiled and

parallelized code as input and generates the following: ○ A set of initial and flow-out bounding boxes ○ The code similar to the host code structure shown earlier

  • The runtime component is implemented as stand-alone

C library.

slide-55
SLIDE 55

Evaluation and Results

slide-56
SLIDE 56

Evaluation and Results

  • Setup

○ A multi-GPU machine consisting of 3 NVIDIA Tesla c2050 (fermi) GPUs and 1 NVIDIA Tesla K20 (Kepler) with 2.5 GB of memory each ○ A 12-core CPU system as the host

slide-57
SLIDE 57

Evaluation and Results

  • Setup

○ A multi-GPU machine consisting of 3 NVIDIA Tesla c2050 (fermi) GPUs and 1 NVIDIA Tesla K20 (Kepler) with 2.5 GB of memory each ○ A 12-core CPU system as the host

  • Benchmarks
slide-58
SLIDE 58

Evaluation Parameters

slide-59
SLIDE 59

Evaluation Parameters

  • Overhead of the runtime library
slide-60
SLIDE 60

Evaluation Parameters

  • Overhead of the runtime library
  • Comparison of data allocation sizes
slide-61
SLIDE 61

Evaluation Parameters

  • Overhead of the runtime library
  • Comparison of data allocation sizes
  • Performance with data scaling
slide-62
SLIDE 62

Evaluation Parameters

  • Overhead of the runtime library
  • Comparison of data allocation sizes
  • Performance with data scaling
  • Comparison with manually written code
slide-63
SLIDE 63

Evaluation Parameters

  • Overhead of the runtime library
  • Comparison of data allocation sizes
  • Performance with data scaling
  • Comparison with manually written code
  • Performance with box-in/box-out
slide-64
SLIDE 64

Evaluation Parameters

  • Overhead of the runtime library
  • Comparison of data allocation sizes
  • Performance with data scaling
  • Comparison with manually written code
  • Performance with box-in/box-out
  • Benefits of inter-tile data reuse
slide-65
SLIDE 65

Evaluation Parameters

  • Overhead of the runtime library
  • Comparison of data allocation sizes
  • Performance with data scaling
  • Comparison with manually written code
  • Performance with box-in/box-out
  • Benefits of inter-tile data reuse
  • Performance with access function split
slide-66
SLIDE 66

Overhead of runtime library

total_execution_time = memory_mgmt_time + compute_time + flowout_time + flowin_time + writeout_time

  • verhead_percentage = (memory_mgmt_time / total_execution_time) * 100
  • For all programs, the runtime overhead was less than

0.1% of the total execution time of the program (hence insignificant)

slide-67
SLIDE 67

Comparison of data allocation sizes

  • Up to 75% reduction on

a 4-GPU machine compared to convex bounding box scheme

  • Equal to the exact data

sizes required (manually computed) for all cases

slide-68
SLIDE 68

Performance with data scaling

  • Data scaling similar to weak

scaling but with emphasis on data size (memory utilization) rather than on problem size (computation)

  • Hence we consider the per-

iteration speedup

  • The per-iteration time includes

all overhead: data allocation time, compute time, flow-out time, flow-in time and write-out time

  • BBMM affects all the above

except compute time

  • Mean speedup of 0.94

indicating near-ideal speedup

slide-69
SLIDE 69

Comparison with manually written code

  • Manual code has following
  • ptimizations:

○ Optimized to have theoretically minimum data allocation sizes and coherency volume ○ Reuse exploitation was theoretical maximum

  • BBMM at least 88% as

efficient as manual OpenCL code

  • Outperforms the manual

OpenACC code

slide-70
SLIDE 70

Benefit of box-in/box-out

  • Significant performance

improvements with tiles that have sufficient compute-to-copy ratio

  • Without it, significant

performance degradation

  • With right tiling strategy,

the feature can allow applications to work with data sizes significantly larger than available GPU memory

slide-71
SLIDE 71

Compute-Copy Overlap

  • Hide the data movement
  • verhead within

computation time

  • Split the computation

allocated to a GPU into multiple tiles

  • Register a callback to be

called at the completion of each tile

  • In the callback perform the

CopyOut() and CopyIn()

  • CopyIn() does not conflict

because we work on a distributed parallel loop

Time

Without compute-copy

  • verlap

With compute-copy

  • verlap

kernel execution copyout copyin SINGLE LARGE TILE TILE 1 TILE 2 TILE 3

slide-72
SLIDE 72

Maximizing Compute-Copy Overlap

Time

Without compute-copy

  • verlap

With compute-copy

  • verlap

kernel execution copyout copyin SINGLE LARGE TILE TILE 1 TILE 2 TILE 3

Time

Without Tile reordering With Tile reordering kernel execution copyout copyin TILE 1 TILE 2 TILE 3 TILE 2 TILE 3 TILE 1 COPYOUT FROM TILE 2

  • Sort the tiles based on size
  • f the CopyOut data
  • Schedule them in the

sorted order (largest copyout size first)

slide-73
SLIDE 73

Related Work

slide-74
SLIDE 74

Conclusion

  • We presented a fully automatic data allocation and memory management

framework for affine loop nests on multi-GPU machines

  • Data allocation, buffer management, inter-GPU coherency were all done at

the granularity of bounding boxes

slide-75
SLIDE 75

Conclusion

  • We presented a fully automatic data allocation and memory management

framework for affine loop nests on multi-GPU machines

  • Data allocation, buffer management, inter-GPU coherency were all done at

the granularity of bounding boxes

  • On a 4-GPU machine our scheme was able to:

○ Achieve allocation size reductions of 75% compared existing schemes ○ Comparison to manual OpenCL and OpenACC code showed: ■ Our code yielded a performance of at least 88% of manual OpenCL code ■ Outperformed OpenACC code in all the cases ○ Achieve excellent data scaling

slide-76
SLIDE 76

Conclusion

  • We presented a fully automatic data allocation and memory management

framework for affine loop nests on multi-GPU machines

  • Data allocation, buffer management, inter-GPU coherency were all done at

the granularity of bounding boxes

  • On a 4-GPU machine our scheme was able to:

○ Achieve allocation size reductions of 75% compared existing schemes ○ Comparison to manual OpenCL and OpenACC code showed: ■ Our code yielded a performance of at least 88% of manual OpenCL code ■ Outperformed OpenACC code in all the cases ○ Achieve excellent data scaling

  • All the above achieved with an insignificant runtime overhead of 0.1%
slide-77
SLIDE 77

Conclusion

  • We presented a fully automatic data allocation and memory management

framework for affine loop nests on multi-GPU machines

  • Data allocation, buffer management, inter-GPU coherency were all done at

the granularity of bounding boxes

  • On a 4-GPU machine our scheme was able to:

○ Achieve allocation size reductions of 75% compared existing schemes ○ Comparison to manual OpenCL and OpenACC code showed: ■ Our code yielded a performance of at least 88% of manual OpenCL code ■ Outperformed OpenACC code in all the cases ○ Achieve excellent data scaling

  • All the above achieved with an insignificant runtime overhead of 0.1%
  • Our work is suited to any compiler/runtime system targeting GPUs
  • Can bridge the data allocation gap that exists in programming these systems
slide-78
SLIDE 78

Publications based on this work

1. Automatic Data Allocation and Buffer Management for Multi-GPU Machines Thejas Ramashekar, Uday Bondhugula, In the ACM Transactions on Architecture and Code Optimization, Vol. 10, No. 4, Article 60, Publication date: December 2013 . Selected for presentation at HiPEAC '14, Jan 2014, Vienna, Austria. 2. Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri, Chandan Reddy, Thejas Ramashekar, and Uday Bondhugula, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT), September 2013.

slide-79
SLIDE 79

References

slide-80
SLIDE 80

References

slide-81
SLIDE 81

Backup Slides

slide-82
SLIDE 82

Data Allocation Scheme Algorithms

slide-83
SLIDE 83

Structure of the generated host code

slide-84
SLIDE 84

Structure of the generated kernel code

slide-85
SLIDE 85

Performance with inter-tile reuse

  • compared to

performance of the same code without reuse

  • mean speedup of

5.4x with maximum speedup of upto 85x

slide-86
SLIDE 86

Performance with access function split

  • compared to performance
  • f code without splits
  • stencils did not undergo

performance degradation

  • floyd in the worst case,

suffered 40% performance loss. But still much better compared to CPU execution times

slide-87
SLIDE 87

Table of contents

  • HPC Setup
  • Multi-GPU Machines
  • Running a program on

multi-GPU machines

  • Role of Data allocation

and memory management

  • Need for an automatic

memory manager

  • Design goals
  • Bounding boxes
  • Overview of BBMM
  • Data allocation scheme
  • Buffer Management
  • Inter-GPU coherency
  • Structure of the

generated code

  • Experimental setup
  • Evaluation and Results
  • Related Work
  • Conclusion and Future

work

slide-88
SLIDE 88

CPU

GPU1 GPU2 GPU N

GLOBAL MEMORY

LOCAL MEM LOCAL MEM LOCAL MEM

PE PE PE CPU GPU NODE DDR RAM

Distributed memory paradigm