Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters - - PowerPoint PPT Presentation

dynamic scheduling for work agglomeration on
SMART_READER_LITE
LIVE PREVIEW

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters - - PowerPoint PPT Presentation

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 7, 2012 Work is overdecomposed in objects Fine-grain task


slide-1
SLIDE 1

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale

University of Illinois Urbana-Champaign

May 7, 2012

slide-2
SLIDE 2

◮ Work is overdecomposed in objects

◮ Fine-grain task parallelism ◮ Ideal for CPU ◮ Overlap of communication and computation ◮ GPUs rely on massive data-parallelism ◮ Fine grains decrease performance ◮ Each kernel instantiation has substantial overhead

◮ To reduce overhead

◮ Combine fine-grain work units for the GPU ◮ Delay may be insignificant if the work is low priority Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 2/22
slide-3
SLIDE 3

Terminology

◮ Agglomeration—composition of distinct work units ◮ Static agglomeration—fixed number of work units are agglomerated ◮ Dynamic agglomeration—number of work units agglomerated varies

at runtime

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 3/22
slide-4
SLIDE 4

Scheduler CPUs Accelerators Accelerator FIFO Work Unit Pool

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 4/22
slide-5
SLIDE 5

Accelerator FIFO

scheduleWork( )

Work Unit Agglomeration

agglomerateWork()

Accelerator

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 5/22
slide-6
SLIDE 6

Programmer/Runtime Division

◮ Programmer

◮ Writes GPU kernel for agglomeration ◮ Creates an offset array ◮ Each task’s input might be a different size ◮ Store the offset of each task’s beginning and ending index in the

contiguous data arrays

◮ System

◮ Decide what work to execute and when Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 6/22
slide-7
SLIDE 7

Low-priority agglomeration message Higher priority GPU message Application's messages Application's messages

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 7/22
slide-8
SLIDE 8

Dynamic Agglomeration

◮ Uses the following heuristic

◮ If the “accelerator FIFO” reaches a size limit, work is agglomerated ◮ Typically set based on memory limitations ◮ Else enqueue a low priority message that causes agglomeration ◮ When higher-priority work is being generated, it goes into the FIFO ◮ When it lets up, work is agglomerated ◮ Since low priority work is assumed, not agglomerating aggressively

should not impact performance

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 8/22
slide-9
SLIDE 9

Input A Input B Output Non-Agglomerated Data Agglomerated Data Input A' Input B' Output' Offset A Offset B

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 9/22
slide-10
SLIDE 10

Case study: Molecular2D

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 10/22
slide-11
SLIDE 11

Molecular2D

◮ Cells

◮ Execute on CPU

◮ Interactions

◮ Execute on GPU Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 11/22
slide-12
SLIDE 12

Molecular 2D Interaction Kernel

__global__ void interact(...) { int i = blockIdx.x * blockDim.x + threadIdx.x; // For loop added for agglomeration for(int j = start[i]; j < end[i]; j++) // interaction work }

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 12/22
slide-13
SLIDE 13

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Number of Particles

20 40 60 80 100 120 140

Execution Time (seconds)

CPU only GPU without agglomeration GPU with agglomeration

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 13/22
slide-14
SLIDE 14

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Number of Particles

1 1.03 1.06 1.09 1.12 1.15 1.18

Speedup of Agglomeration

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 14/22
slide-15
SLIDE 15

500 1000 1500 2000 2500

Number of Particles per Work Unit

85 90 95 100 105 110 115 120

Execution Time (seconds)

GPU without agglomeration GPU with agglomeration

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 15/22
slide-16
SLIDE 16

5 10 15 20 25 30

Static Agglomeration Packet Size

3.8 4 4.2 4.4 4.6 4.8 5 5.2

Execution Time (seconds)

Dynamic Scheduled Agglomeration Static Agglomeration

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 16/22
slide-17
SLIDE 17

Case study: LU Factorization without pivoting

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 17/22
slide-18
SLIDE 18

A1,1 A2,1 A1,2 A2,2

❅ ❅ ❅ ❅

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 18/22
slide-19
SLIDE 19

LU Factorization

◮ CPU

◮ Diagonal ◮ Triangular solves

◮ GPU

◮ Matrix-matrix multiples Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 19/22
slide-20
SLIDE 20

4096 6144 8192 10240

Matrix Size (X by X)

10 20 30 40

Execution Time (seconds)

CPU GPU without agglomeration GPU with agglomeration

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 20/22
slide-21
SLIDE 21

20 40 60 80 100 120

Static Packet Size

36 38 40 42 44 46 48 50

Execution Time (seconds)

Dynamic Agglomeration Static Agglomeration

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 21/22
slide-22
SLIDE 22

Conclusion

◮ For both benchmarks, agglomerating work increases performance ◮ Agglomeration does not need to be application-specific ◮ Statically selecting work units to agglomerate is difficult and may

reduce performance

◮ Runtimes can agglomerate automatically

◮ An agglomerating kernel still must written ◮ Obtains better performance than static Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

  • Jonathan Lifflander (UIUC)
  • 22/22