Optimizing Explicit Data Transfers for Data Parallel Applications on - - PowerPoint PPT Presentation

optimizing explicit data transfers for data parallel
SMART_READER_LITE
LIVE PREVIEW

Optimizing Explicit Data Transfers for Data Parallel Applications on - - PowerPoint PPT Presentation

Optimizing Explicit Data Transfers for Data Parallel Applications on Heterogeneous Multi-core Platforms S. Saidi 1 , 2 P.Tendulkar 1 T. Lepley 2 O. Maler 1 1 Verimag 2 STMicroelectronics Hipeac 2012 Saidi, Tendulkar, Lepley, Maler Data Transfers


slide-1
SLIDE 1

Optimizing Explicit Data Transfers for Data Parallel Applications on Heterogeneous Multi-core Platforms

  • S. Saidi1,2

P.Tendulkar 1

  • T. Lepley2
  • O. Maler1

1Verimag 2STMicroelectronics

Hipeac 2012

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 1 / 56

slide-2
SLIDE 2

Outline

1

Introduction

2

Optimal Granularity One Processor Multiple Processors

3

Shared Data Transfers

4

Experiments on the CELL Architecture

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 2 / 56

slide-3
SLIDE 3

Introduction

Outline

1

Introduction

2

Optimal Granularity One Processor Multiple Processors

3

Shared Data Transfers

4

Experiments on the CELL Architecture

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 3 / 56

slide-4
SLIDE 4

Introduction

Motivation

How to reduce/hide the off-chip memory latency?

Memory Memory

Interconnect ... Multi−core fabric Host CPU Off−chip Memory

PE0 PEn

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 4 / 56

slide-5
SLIDE 5

Introduction

Heterogeneous Multi-core Architectures

a powerful host processor and a multi-core fabric to accelerate computationally heavy kernels.

Memory Memory

Interconnect ... Multi−core fabric Host CPU Off−chip Memory

PE0 PEn

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 5 / 56

slide-6
SLIDE 6

Introduction

Heterogeneous Multi-core Architectures

a powerful host processor and a multi-core fabric to accelerate computationally heavy kernels.

Memory Memory

Interconnect ... Multi−core fabric Host CPU Off−chip Memory

T0 T2 T1 PE0 PEn

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 5 / 56

slide-7
SLIDE 7

Introduction

Data Transfers

Offloadable kernels work on large data sets, initially stored in the off-chip memory.

Memory Memory

Interconnect ... Multi−core fabric Host CPU Off−chip Memory Algorithm

.... .... T0 PE0 PEn

  • d

X Y

Y [i] = f (X[i]) n − 1 i = 0 for to Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 6 / 56

slide-8
SLIDE 8

Introduction

Data Transfers

High off-chip memory latency: accessing off-chip data is very costly

Memory Memory

Interconnect ... Multi−core fabric Host CPU Off−chip Memory Algorithm

.... .... Write Read T0 PE0 PEn

  • d

X Y

Y [i] = f (X[i]) n − 1 i = 0 for to Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 7 / 56

slide-9
SLIDE 9

Introduction

Data Transfers

Data is transferred to a closer but smaller on-chip memory, using DMAs (Direct Memory Access).

... Data Block Transfers

block 0 block 1 Memory Memory

Interconnect ... Multi−core fabric Host CPU Off−chip Memory Algorithm

.... .... T0 PE0 PEn

  • d

X Y

Y [i] = f (X[i]) n − 1 i = 0 for to Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 8 / 56

slide-10
SLIDE 10

Introduction

DMA Data Transfers

s: number of array elements in one block,

  • ...

X

X[0]

block0

X[1]

block1 blockm−2 blockm−1

X[n − 1]

s

n Fetch(blocki) Compute(blocki) Write back(blocki) i = 0 while (i < n/s) i + + Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 9 / 56

slide-11
SLIDE 11

Introduction

DMA Data Transfers

s: number of array elements in one block,

  • ...

X

X[0]

block0

X[1]

block1 blockm−2 blockm−1

X[n − 1]

s

n dma get(local-buffer, blocki, s) Fetch(blocki) Compute(blocki) Write back(blocki) i = 0 while (i < n/s) i + + Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 10 / 56

slide-12
SLIDE 12

Introduction

DMA Data Transfers

s: number of array elements in one block,

  • ...

X

X[0]

block0

X[1]

block1 blockm−2 blockm−1

X[n − 1]

s

n Compute(blocki) Fetch(blocki) Compute(blocki) Write back(blocki) i = 0 while (i < n/s) i + + Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 11 / 56

slide-13
SLIDE 13

Introduction

DMA Data Transfers

s: number of array elements in one block,

  • ...

X

X[0]

block0

X[1]

block1 blockm−2 blockm−1

X[n − 1]

s

n Write back(blocki) dma put(blocki, local-buffer, s) Fetch(blocki) Compute(blocki) Write back(blocki) i = 0 while (i < n/s) i + + Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 12 / 56

slide-14
SLIDE 14

Introduction

DMA Data Transfers

s: number of array elements in one block,

  • ...

X

X[0]

block0

X[1]

block1 blockm−2 blockm−1

X[n − 1]

s

n dma get(local-buffer, blocki, s) Fetch(blocki) Compute(blocki) Write back(blocki) i = 0 while (i < n/s) i + +

Sequential execution of computations and data transfers.

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 13 / 56

slide-15
SLIDE 15

Introduction

DMA Data Transfers

s: number of array elements in one block,

  • ...

X

X[0]

block0

X[1]

block1 blockm−2 blockm−1

X[n − 1]

s

n Compute(blocki) Fetch(blocki) Compute(blocki) Write back(blocki) i = 0 while (i < n/s) i + +

Sequential execution of computations and data transfers.

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 14 / 56

slide-16
SLIDE 16

Introduction

DMA Data Transfers

s: number of array elements in one block,

  • ...

X

X[0]

block0

X[1]

block1 blockm−2 blockm−1

X[n − 1]

s

n Write back(blocki) dma put(blocki, local-buffer, s) Fetch(blocki) Compute(blocki) Write back(blocki) i = 0 while (i < n/s) i + +

Sequential execution of computations and data transfers.

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 15 / 56

slide-17
SLIDE 17

Introduction

Double Buffering

Asynchronous DMA calls and double buffering:

Fetch(blocki+1) Fetch(block0) Compute(blocki) Write back(blocki) Compute(block(n/s) − 1) Write back(block(n/s)−1) dma get(local − buffer[1], block0, s) dma get(local − buffer[2], blocki+1, s) i = 0 while (i < (n/s) − 1) i + +

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 16 / 56

slide-18
SLIDE 18

Introduction

Double Buffering

Asynchronous DMA calls and double buffering:

Fetch(block0) Fetch(blocki+1) Fetch(block0) Compute(blocki) Write back(blocki) Compute(block(n/s) − 1) Write back(block(n/s)−1) dma get(local − buffer[1], block0, s) dma get(local − buffer[2], blocki+1, s) i = 0 while (i < (n/s) − 1) i + +

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 17 / 56

slide-19
SLIDE 19

Introduction

Double Buffering

Asynchronous DMA calls and double buffering:

Compute(blocki) Fetch(blocki+1) Fetch(blocki+1) Fetch(block0) Compute(blocki) Write back(blocki) Compute(block(n/s) − 1) Write back(block(n/s)−1) dma get(local − buffer[1], block0, s) dma get(local − buffer[2], blocki+1, s) i = 0 while (i < (n/s) − 1) i + +

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 18 / 56

slide-20
SLIDE 20

Introduction

Double Buffering

Asynchronous DMA calls and double buffering:

Write back(blocki) Fetch(blocki+1) Fetch(block0) Compute(blocki) Write back(blocki) Compute(block(n/s) − 1) Write back(block(n/s)−1) dma get(local − buffer[1], block0, s) dma get(local − buffer[2], blocki+1, s) i = 0 while (i < (n/s) − 1) i + +

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 19 / 56

slide-21
SLIDE 21

Introduction

Double Buffering

Asynchronous DMA calls and double buffering:

Compute(blocki) Fetch(blocki+1) Fetch(blocki+1) Fetch(block0) Compute(blocki) Write back(blocki) Compute(block(n/s) − 1) Write back(block(n/s)−1) dma get(local − buffer[1], block0, s) dma get(local − buffer[2], blocki+1, s) i = 0 while (i < (n/s) − 1) i + +

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 20 / 56

slide-22
SLIDE 22

Introduction

Double Buffering

Asynchronous DMA calls and double buffering:

Compute(blocki) Fetch(blocki+1) Fetch(blocki+1) Fetch(block0) Compute(blocki) Write back(blocki) Compute(block(n/s) − 1) Write back(block(n/s)−1) dma get(local − buffer[1], block0, s) dma get(local − buffer[2], blocki+1, s) i = 0 while (i < (n/s) − 1) i + +

Overlap of computations and data transfers.

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 21 / 56

slide-23
SLIDE 23

Introduction

Double Buffering

Asynchronous DMA calls and double buffering:

Compute(blocki) Fetch(blocki+1) Fetch(blocki+1) Fetch(block0) Compute(blocki) Write back(blocki) Compute(block(n/s) − 1) Write back(block(n/s)−1) dma get(local − buffer[1], block0, s) dma get(local − buffer[2], blocki+1, s) i = 0 while (i < (n/s) − 1) i + +

Overlap of computations and data transfers. What is the choice s∗ that optimizes performance?

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 22 / 56

slide-24
SLIDE 24

Introduction

Our Contribution

1 We derive optimal granularity for DMA transfers given,

One processor. Multiple processors.

2 We compare different strategies for transferring shared data Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 23 / 56

slide-25
SLIDE 25

Optimal Granularity

Outline

1

Introduction

2

Optimal Granularity One Processor Multiple Processors

3

Shared Data Transfers

4

Experiments on the CELL Architecture

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 24 / 56

slide-26
SLIDE 26

Optimal Granularity One Processor

Outline

1

Introduction

2

Optimal Granularity One Processor Multiple Processors

3

Shared Data Transfers

4

Experiments on the CELL Architecture

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 25 / 56

slide-27
SLIDE 27

Optimal Granularity One Processor

Double Buffering Pipelined Execution

Overlap of, Computation of current block, Transfer of next block.

Proc idle time

b0 b0 Transfer Input Computation Output Transfer Prologue b0 Time

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 26 / 56

slide-28
SLIDE 28

Optimal Granularity One Processor

Double Buffering Pipelined Execution

Overlap of, Computation of current block, Transfer of next block.

Proc idle time

b1 b1 b1 b0 b0 Transfer Input Computation Output Transfer Prologue b0 Time

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 27 / 56

slide-29
SLIDE 29

Optimal Granularity One Processor

Double Buffering Pipelined Execution

Overlap of, Computation of current block, Transfer of next block.

Proc idle time

b2 b2 b2 b1 b1 b1 b0 b0 Transfer Input Computation Output Transfer Prologue b0 Time

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 28 / 56

slide-30
SLIDE 30

Optimal Granularity One Processor

Double Buffering Pipelined Execution

Overlap of, Computation of current block, Transfer of next block.

Proc idle time

b3 b3 b3 b2 b2 b2 b1 b1 b1 b0 b0 Transfer Input Computation Output Transfer Prologue b0 Time

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 29 / 56

slide-31
SLIDE 31

Optimal Granularity One Processor

Double Buffering Pipelined Execution

Overlap of, Computation of current block, Transfer of next block.

Proc idle time

b4 b4 b4 Epilogue b3 b3 b3 b2 b2 b2 b1 b1 b1 b0 b0 Transfer Input Computation Output Transfer Prologue b0 Time

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 30 / 56

slide-32
SLIDE 32

Optimal Granularity One Processor

Computation and Transfer Time of a Data Block:

s: nb array elements clustered in one block, Computation time C(s): DMA Transfer time T(s):

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 31 / 56

slide-33
SLIDE 33

Optimal Granularity One Processor

Computation and Transfer Time of a Data Block:

s: nb array elements clustered in one block, Computation time C(s): ω: time to compute one element, DMA Transfer time T(s):

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 31 / 56

slide-34
SLIDE 34

Optimal Granularity One Processor

Computation and Transfer Time of a Data Block:

s: nb array elements clustered in one block, Computation time C(s): ω: time to compute one element, C(s) = ω · s DMA Transfer time T(s):

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 31 / 56

slide-35
SLIDE 35

Optimal Granularity One Processor

Computation and Transfer Time of a Data Block:

s: nb array elements clustered in one block, Computation time C(s): ω: time to compute one element, C(s) = ω · s DMA Transfer time T(s): I: fixed DMA initialization cost,

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 31 / 56

slide-36
SLIDE 36

Optimal Granularity One Processor

Computation and Transfer Time of a Data Block:

s: nb array elements clustered in one block, Computation time C(s): ω: time to compute one element, C(s) = ω · s DMA Transfer time T(s): I: fixed DMA initialization cost, α: transfer cost per byte,

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 31 / 56

slide-37
SLIDE 37

Optimal Granularity One Processor

Computation and Transfer Time of a Data Block:

s: nb array elements clustered in one block, Computation time C(s): ω: time to compute one element, C(s) = ω · s DMA Transfer time T(s): I: fixed DMA initialization cost, α: transfer cost per byte, b: size of one array element,

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 31 / 56

slide-38
SLIDE 38

Optimal Granularity One Processor

Computation and Transfer Time of a Data Block:

s: nb array elements clustered in one block, Computation time C(s): ω: time to compute one element, C(s) = ω · s DMA Transfer time T(s): I: fixed DMA initialization cost, α: transfer cost per byte, b: size of one array element, T(s) = I + α · b · s

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 31 / 56

slide-39
SLIDE 39

Optimal Granularity One Processor

Computation and Transfer Ratio:

Transfer/computation time per block block size C(s) ¯ s T(s) s∗ I s Computation Domain Transfer Domain

time b0 b0 b0 b0 b0 b0 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b3 b3 b3 b4 b4 b4 b5 b5 b5 Output transfer Computation Input transfer Input transfer Computation Output transfer b6 b7 b8 b6 b7 b8 b6 b7 b8

Transfer Regime

(a) Prologue

Computation Regime

(b) s = 3 s = 1 Prologue Epilogue Epilogue Proc Idle Time

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 32 / 56

slide-40
SLIDE 40

Optimal Granularity One Processor

Optimal Granularity

Optimal Granularity s∗:

C(s∗) = T(s∗)

Transfer/computation time per block s C(s) ¯ s T(s) s∗ Computation regime Transfer regime I

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 33 / 56

slide-41
SLIDE 41

Optimal Granularity One Processor

Optimal Granularity

Optimal Granularity s∗:

C(s∗) = T(s∗)

Transfer/computation time per block s C(s) ¯ s T(s) s∗ Computation regime Transfer regime I

Pipeline execution time τ(s): τ(s) =

  • (n/s + 1) T(s)

Transfer 2 · T(s) + n · ω Computation

Computation Regime Program execution time using double buffering Transfer Regime s∗ τ(s) s

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 33 / 56

slide-42
SLIDE 42

Optimal Granularity Multiple Processors

Outline

1

Introduction

2

Optimal Granularity One Processor Multiple Processors

3

Shared Data Transfers

4

Experiments on the CELL Architecture

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 34 / 56

slide-43
SLIDE 43

Optimal Granularity Multiple Processors

Multiple Processors

Partitioning: p contiguous chunks of data

P2 P1

Array

1 P3 P4

n p

... 1 ... 1 ... 1 ...

n p n p n p

Assumptions:

Processors are identical: same local store capacity, same double buffering granularity...etc. A separate DMA per processor.

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 35 / 56

slide-44
SLIDE 44

Optimal Granularity Multiple Processors

Multiple Processors

Pipelined execution for several processors:

  • Time

Transfer Transfer Prologue Proc Idle time Output Input b0, b1, b2 P0 P1 P2

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 36 / 56

slide-45
SLIDE 45

Optimal Granularity Multiple Processors

Multiple Processors

Pipelined execution for several processors:

  • Time

Transfer Transfer Prologue Proc Idle time b3, b4, b5 b2 b0 b1 Output Input b0, b1, b2 P0 P1 P2

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 36 / 56

slide-46
SLIDE 46

Optimal Granularity Multiple Processors

Multiple Processors

Pipelined execution for several processors:

  • Time

Transfer Transfer Prologue Proc Idle time b0, b1, b2 b6, b7, b8 b3 b4 b5 b3, b4, b5 b2 b0 b1 Output Input b0, b1, b2 P0 P1 P2

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 36 / 56

slide-47
SLIDE 47

Optimal Granularity Multiple Processors

Multiple Processors

Pipelined execution for several processors:

  • Epilogue
  • Time

Transfer Transfer Prologue Proc Idle time b3, b4, b5 b6, b7, b8 b6 b7 b8 b0, b1, b2 b6, b7, b8 b3 b4 b5 b3, b4, b5 b2 b0 b1 Output Input b0, b1, b2 P0 P1 P2

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 36 / 56

slide-48
SLIDE 48

Optimal Granularity Multiple Processors

Multiple Processors

Pipelined execution for several processors:

  • Epilogue
  • Time

Transfer Transfer Prologue Proc Idle time b3, b4, b5 b6, b7, b8 b6 b7 b8 b0, b1, b2 b6, b7, b8 b3 b4 b5 b3, b4, b5 b2 b0 b1 Output Input b0, b1, b2 P0 P1 P2

Computation time of one block: C(s) = ω · s

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 36 / 56

slide-49
SLIDE 49

Optimal Granularity Multiple Processors

Multiple Processors

Pipelined execution for several processors:

  • Epilogue
  • Time

Transfer Transfer Prologue Proc Idle time b3, b4, b5 b6, b7, b8 b6 b7 b8 b0, b1, b2 b6, b7, b8 b3 b4 b5 b3, b4, b5 b2 b0 b1 Output Input b0, b1, b2 P0 P1 P2

Computation time of one block: C(s) = ω · s DMA transfer time of one block, given p processors, T(s, p) = I + α(p) · b · s

α(p) : transfer cost per byte given contentions of p concurrent transfer requests.

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 36 / 56

slide-50
SLIDE 50

Optimal Granularity Multiple Processors

Multiple Processors: Optimal Granularity

Optimal granularity s∗

p,

T(s∗

p, p) = C(s∗ p)

time s ¯ s T(s, 1) C(s) s∗

1

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 37 / 56

slide-51
SLIDE 51

Optimal Granularity Multiple Processors

Multiple Processors: Optimal Granularity

Optimal granularity s∗

p,

T(s∗

p, p) = C(s∗ p)

time T(s, 2) s∗

2

s ¯ s T(s, 1) C(s) s∗

1

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 38 / 56

slide-52
SLIDE 52

Optimal Granularity Multiple Processors

Multiple Processors: Optimal Granularity

Optimal granularity s∗

p,

T(s∗

p, p) = C(s∗ p)

time T(s, 3) s∗

3

T(s, 2) s∗

2

s ¯ s T(s, 1) C(s) s∗

1

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 39 / 56

slide-53
SLIDE 53

Optimal Granularity Multiple Processors

Multiple Processors: Optimal Granularity

Optimal granularity s∗

p,

T(s∗

p, p) = C(s∗ p)

time T(s, 3) s∗

3

T(s, 2) s∗

2

s ¯ s T(s, 1) C(s) s∗

1

Optimal Granularity increases with number of processors

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 39 / 56

slide-54
SLIDE 54

Shared Data Transfers

Outline

1

Introduction

2

Optimal Granularity One Processor Multiple Processors

3

Shared Data Transfers

4

Experiments on the CELL Architecture

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 40 / 56

slide-55
SLIDE 55

Shared Data Transfers

Applications with shared data

Data parallel loop with shared input data: for i := 0 to n − 1 do Y [i] := f (X[i], V [i]); V [i] = {X[i − 1], X[i − 2], ..., X[i − k]}

  • d
  • Saidi, Tendulkar, Lepley, Maler

Data Transfers in Arch. 41 / 56

slide-56
SLIDE 56

Shared Data Transfers

Applications with shared data

Data parallel loop with shared input data: for i := 0 to n − 1 do Y [i] := f (X[i], V [i]); V [i] = {X[i − 1], X[i − 2], ..., X[i − k]}

  • d

Neighboring blocks share data:

  • Neighboring Shared Data

n s b0 b1 b2 b3 b4 b5 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 41 / 56

slide-57
SLIDE 57

Shared Data Transfers

Strategies for transferring shared data

1 Replication 2 Inter-processor communication 3 Local buffering Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 42 / 56

slide-58
SLIDE 58

Shared Data Transfers

Strategies for transferring shared data

1 Replication 2 Inter-processor communication 3 Local buffering Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 43 / 56

slide-59
SLIDE 59

Shared Data Transfers

  • 1. Replication

Multi−core fabric

  • Off−chip Memory

Neighboring Shared Data Memory Memory n s b0 b1 b2 b3 b4 b5 P0 P1 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 44 / 56

slide-60
SLIDE 60

Shared Data Transfers

  • 1. Replication
  • DMA

Multi−core fabric

  • Off−chip Memory

Neighboring Shared Data Memory Memory b2 b3 n s b0 b1 b2 b3 b4 b5 P0 P1 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 44 / 56

slide-61
SLIDE 61

Shared Data Transfers

Strategies for transferring shared data

1 Replication 2 Inter-processor communication 3 Local buffering Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 45 / 56

slide-62
SLIDE 62

Shared Data Transfers

  • 2. Inter-processor communication:

Neighboring blocks computed neighboring processors,

Multi−core fabric

  • Off−chip Memory

Neighboring Shared Data Memory Memory n s b0 b1 b2 b3 b4 b5 P0 P1 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 46 / 56

slide-63
SLIDE 63

Shared Data Transfers

  • 2. Inter-processor communication:

Neighboring blocks computed neighboring processors,

DMA

NoC

Multi−core fabric

  • Off−chip Memory

Neighboring Shared Data Memory Memory b3 b2 n s b0 b1 b2 b3 b4 b5 P0 P1 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 46 / 56

slide-64
SLIDE 64

Shared Data Transfers

  • 2. Inter-processor communication:

Neighboring blocks computed neighboring processors,

Inter−processor communication DMA

NoC

Multi−core fabric

  • Off−chip Memory

Neighboring Shared Data Memory Memory

  • b3

b2 n s b0 b1 b2 b3 b4 b5 P0 P1

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 46 / 56

slide-65
SLIDE 65

Shared Data Transfers

Strategies for transferring shared data

1 Replication 2 Inter-processor communication 3 Local buffering Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 47 / 56

slide-66
SLIDE 66

Shared Data Transfers

  • 3. Local Buffering

Neighboring blocks computed by the same processor,

Multi−core fabric

  • Off−chip Memory

Neighboring Shared Data Memory Memory n s b0 b1 b2 b3 b4 b5 P0 P1 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 48 / 56

slide-67
SLIDE 67

Shared Data Transfers

  • 3. Local Buffering

Neighboring blocks computed by the same processor,

DMA Multi−core fabric

  • Off−chip Memory

Neighboring Shared Data Memory Memory b2 n s b0 b1 b2 b3 b4 b5 P0 P1 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 48 / 56

slide-68
SLIDE 68

Shared Data Transfers

  • 3. Local Buffering

Neighboring blocks computed by the same processor,

DMA Multi−core fabric

  • Off−chip Memory

Neighboring Shared Data Memory Memory b2 b3 n s b0 b1 b2 b3 b4 b5 P0 P1 Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 48 / 56

slide-69
SLIDE 69

Shared Data Transfers

  • 3. Local Buffering

Neighboring blocks computed by the same processor,

DMA Multi−core fabric

  • Off−chip Memory

Neighboring Shared Data Memory Memory

  • Copy shared

data locally

b2 b3 n s b0 b1 b2 b3 b4 b5 P0 P1

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 48 / 56

slide-70
SLIDE 70

Shared Data Transfers

Comparing Strategies

For each strategy,

1 we characterize the cost of transferring shared data, Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 49 / 56

slide-71
SLIDE 71

Shared Data Transfers

Comparing Strategies

For each strategy,

1 we characterize the cost of transferring shared data, 2 we derive optimal granularity, Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 49 / 56

slide-72
SLIDE 72

Shared Data Transfers

Comparing Strategies

For each strategy,

1 we characterize the cost of transferring shared data, 2 we derive optimal granularity, 3 we evaluate overall execution time in computation and transfer

regimes.

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 49 / 56

slide-73
SLIDE 73

Shared Data Transfers

Comparing Strategies

Based on a parametric study, we derive optimal strategy for transferring shared data,

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 50 / 56

slide-74
SLIDE 74

Shared Data Transfers

Comparing Strategies

Replication: ⊖ processors contentions overhead

  • Epilogue

Time Proc Idle Time Prologue Block Transfers with Replication Output Input P2 P1 P0 b0, b1, b2 b0, b1, b2 b3, b4, b5 b3, b4, b5 b6, b7, b8 b6, b7, b8 b0 b1 b2 b3 b4 b5 b6 b7 b8

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 50 / 56

slide-75
SLIDE 75

Shared Data Transfers

Comparing Strategies

Local Buffering and inter-processor communications: ⊖ processing overhead

  • Prologue

Epilogue IPC IPC IPC Time Proc Idle Time IPC: Inter−Processor Communication

  • Output

Input P2 P1 P0 b0, b1, b2 b0, b1, b2 b3, b4, b5 b3, b4, b5 b6, b7, b8 b6, b7, b8 b0 b1 b2 b3 b4 b5 b6 b7 b8

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 50 / 56

slide-76
SLIDE 76

Experiments on the CELL Architecture

Outline

1

Introduction

2

Optimal Granularity One Processor Multiple Processors

3

Shared Data Transfers

4

Experiments on the CELL Architecture

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 51 / 56

slide-77
SLIDE 77

Experiments on the CELL Architecture

Overview of Cell B.E. Architecture

SPU MMU MFC SPU MMU MFC SPU MMU MFC SPU MMU MFC SPU MMU MFC SPU MMU MFC SPU MMU MFC SPU MMU MFC

PPU MMU L2 XDR DRAM Interface I/O Interface Interface Coherent EIB

Platform Characteristics: 9-core heterogeneous multi-core architecture, with a Power Processor Element(PPE) and 8 Synergistic Processing Elements(SPE). Limited local store capacity per SPE: 256 Kbytes Explicitely managed memory system, using DMAs

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 52 / 56

slide-78
SLIDE 78

Experiments on the CELL Architecture

Measured DMA Latency

16 64 256 1024 4096 16384 103 104 super-block size (s · b) Transfer Time (clock cycles) 1 SPU 2 SPU 4 SPU 8 SPU

Profiled hardware parameters:

DMA issue time I ⋍ 400 clock cycles Off-chip memory transfer cost/byte: 1 proc α(1) 0.22 clock cycles Off-chip memory transfer cost/byte: p procs α(p) ⋍ p · α(1) inter-processor comm transfer cost/byte for β 0.13 clock cycles

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 53 / 56

slide-79
SLIDE 79

Experiments on the CELL Architecture

Optimal Granularity

16 64 256 1024 4096 16384 0.5 1 1.5 ·107 super-block size (s · b) Execution Time (clock cycles) 2 SPU-pred 8 SPU-pred 2 SPU-meas 8 SPU-meas

predicted optimal granularities give good performance.

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 54 / 56

slide-80
SLIDE 80

Experiments on the CELL Architecture

Shared Data Transfers

  • Neighboring Shared Data

n s b0 b1 b2 b3 b4 b5

shared data size is 1024 bytes.

1024 2048 4096 8192 105.8 106 106.2 super-block size (s · b) Execution Time (clock cycles) 2 repl 2 ipc 2 local buffering 8 repl 8 ipc 8 local buffering Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 55 / 56

slide-81
SLIDE 81

Experiments on the CELL Architecture

Conclusion

We presented a general methodology for automating decisions about,

1

Optimal granularity for data transfers.

2

Optimal strategy for transferring shared data.

We validated the experiments on the Cell architecture. On-going Work and Perspectives:,

1

Extend the work to other platforms, like P2012,

2

Extend the work to multidimensional data,

3

Consider computation variabilities.

Saidi, Tendulkar, Lepley, Maler Data Transfers in Arch. 56 / 56