JIT renaming and lazy write- back on the Cell/B.E. Pieter Bellens, - - PowerPoint PPT Presentation

jit renaming and lazy write back on the cell b e
SMART_READER_LITE
LIVE PREVIEW

JIT renaming and lazy write- back on the Cell/B.E. Pieter Bellens, - - PowerPoint PPT Presentation

JIT renaming and lazy write- back on the Cell/B.E. Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS) pieter.bellens@bsc.es Overview Cell Broadband Engine (Cell/B.E.) Cell


slide-1
SLIDE 1

JIT renaming and lazy write- back on the Cell/B.E.

Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta

Barcelona Supercomputing Center (BSC-CNS) pieter.bellens@bsc.es

slide-2
SLIDE 2

Overview

  • Cell Broadband Engine (Cell/B.E.)
  • Cell Superscalar (CellSs)
  • Bypassing
  • Motivation
  • Implementation
  • Results
  • Lazy write-back
  • Just-In-Time renaming
  • Current status and ongoing work
slide-3
SLIDE 3

Cell Broadband Engine

256 Kb asynchronous DMA transfers 2 hardware threads

slide-4
SLIDE 4

CellSs

Runtime environment that automatically parallelizes sequential user applications for the Cell/B.E. PPE

SPE

user application CellSs compiler CellSs SPE runtime CellSs PPE runtime Parallel Cell/B.E. application

SPE SPE SPE SPE SPE SPE SPE

slide-5
SLIDE 5

CellSs: sample code (sparse LU)

int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } } } } void lu0(float *diag); void bdiv(float *diag, float *row); void bmod(float *row, float *col, float *inner); void fwd(float *diag, float *col);

B B NB NB B B

slide-6
SLIDE 6

CellSs: sample code (sparse LU)

int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } } } } #pragma css task inout(diag[B][B]) void lu0(float *diag); #pragma css task input(diag[B][B]) inout(row[B][B]) void bdiv(float *diag, float *row); #pragma css task input(row[B][B],col[B][B]) inout(inner[B][B]) void bmod(float *row, float *col, float *inner); #pragma css task input(diag[B][B]) inout(col[B][B]) void fwd(float *diag, float *col);

B B NB NB B B

slide-7
SLIDE 7

CellSs: compiler

app.c CellSs compiler app_spe.c app_ppe.c llib_css-spe.so Cell executable llib_css-ppe.so

SPE Linker

PPE Linker

SPE executable

SPE Compiler app_spe.o

PPE Compiler

app_ppe.o SPE Embedder SPE Linker PPE Object

SDK

CellSs SPE library CellSs SPE library annotated user application parallel Cell/B.E. application

slide-8
SLIDE 8

CellSs: runtime libraries

PPE memory

user data

TDG

user main program CellSs main thread CellSs helper thread

1) task creation 2) dependence analysis and data renaming 3) update TDG 4) scheduling 5) synchronisation with SPEs

renaming table 1 2 3 4 5

SPE

  • riginal

task code

6) stage in 7) execute 8) stage out and synchronisation

6 7 8

slide-9
SLIDE 9

CellSs: runtime behaviour (matrix multiply)

  • Visualization of the runtime phases in function of time using Paraver
  • Each phase is assigned a different colour
  • SPE task execution
  • SPE DMA wait
  • Thread idling
slide-10
SLIDE 10

CellSs: runtime behaviour (matrix multiply)

slide-11
SLIDE 11

Bypassing: motivation

A new architecture, but the song remains the same:

Improve the performance

  • General computation pattern
  • PPE generates work for SPEs
  • SPEs repeatedly fetch work and perform computation
  • T

raditional approach vs. bypassing approach

  • Cell/B.E. Interconnect
  • Element Interconnect Bus (EIB)

Let's take a closer look at code executing on the Cell/B.E.:

slide-12
SLIDE 12

Bypassing: motivation: general computation pattern

stage in execute stage out

memory access! memory access! 2 stage out 3 stage in 4 stage out

traditional:

main memory

1 stage in

SPE1 SPE2

slide-13
SLIDE 13

Bypassing: motivation: Cell/B.E. interconnect

Element Interconnect Bus (EIB):

“Another class of bottlenecks is contention. For instance, if four SPEs are trying to move data

to or from the MIC at the same time, their aggregate bandwidth of 102.4GB/sec completely swamps the MIC's bandwidth of 25.6GB/sec. Similarly, while the SPEs are trying to interact with the MIC, the PPE may have degraded access to main memory. When a unit is overwhelmed, it might need to retry commands, which in turn slows traffic down even further.” David Krolak, “Unleashing the Cell Broadband Engine Processor: the Element Interconnect Bus”

slide-14
SLIDE 14

Bypassing: motivation

How does contention and blocking influence the execution?

  • Countermeasures:
  • software cache in the LS of an SPE
  • double buffering
  • ???
slide-15
SLIDE 15

Bypassing: motivation

Transfer objects between the LS of SPEs without going through main memory

  • General idea:
  • Effect on PPE threads?
slide-16
SLIDE 16

Bypassing: motivation

stage in execute stage out

memory access

  • r bypass

free up LS space

main memory

1 stage in 2 bypass (3 stage out)

bypassing:

SPE1 SPE2

slide-17
SLIDE 17

Bypassing: implementation

  • General solution
  • SPE runtime autonomously decides to go to main memory or to

bypass from another SPE

  • No need to tailor the bypassing mechanism to a specific

application

  • Implemented using the SPE's Atomic Cache Unit (ACU)
  • Location of software objects in the system is updated using the

ACU

  • Distributed solution
  • Makes good use of hardware features
slide-18
SLIDE 18

Bypassing: results: opportunities for bypassing

Are there opportunities to bypass data from one SPE to another?

slide-19
SLIDE 19

Bypassing: results: reduction in wait time

Does the wait time effectively decrease when bypassing?

slide-20
SLIDE 20

Lazy write-back: concept

#pragma css task inout(a) void foo(int a[4096]); int a[4096]; int main(int argc, char *argv[]) { ... foo(a); ... foo(a); ... return 0; }

main memory

1 stage in 2 bypass (3 stage out) (4 stage out)

  • Do not tranfer objects back to main memory unless strictly necessary
  • Exploit the information available in the bypassing mechanism
  • object versions
  • read count of a version
  • T
  • ken passing to avoid early stage-outs

a[4096]

SPE1

buffer

SPE2

buffer

slide-21
SLIDE 21

Lazy write-back: example

main memory

1 stage in 2 bypass (3 stage out) (7 stage out)

  • 1. T

ask 1 reads and writes a → Obj(a,1)

  • 2. T

ask 2 reads a (Obj(a,1))

  • 3. T

ask 3 reads a (Obj(a,1))

  • 4. T

ask 4 reads and writes a → Obj(a,2)

  • 5. T

ask 5 reads a (Obj(a,2))

a[4096]

SPE1

buffer

SPE2

buffer

  • Below is the perfect scenario
  • Variations possible depending on

relative ordering of execution of tasks and schedule SPE3

buffer 1 2 3 5 4 4 bypass 8 stage out 6 bypass (5 stage out)

slide-22
SLIDE 22

Lazy write-back: results

Can we avoid a significant fraction of the tranfers to main memory?

slide-23
SLIDE 23

Renaming: traditional concept

#pragma css task inout(a) void foo(int a[4096]); #pragma css task out(a) void moo(int a[4096]); int a[4096]; int main(int argc, char *argv[]) { ... ... ... return 0; }

main memory

  • riginal object, “user space”

renaming, “CellSs space”

  • Renaming improves parallelism at the cost of extra

memory.

  • Centralized

A[4096] A_ren[4096] foo(a) moo(a)

slide-24
SLIDE 24

Renaming: traditional concept

SPE1 SPE2 main memory

A[4096] A_ren[4096] buffer A buffer B foo moo Explicit renaming in main memory Explicit renaming in LS

slide-25
SLIDE 25

Renaming: JIT renaming

SPE1 SPE2 main memory

A[4096] bypass buffer B foo moo

  • riginal object

in main memory implicit renaming in LS buffer A

slide-26
SLIDE 26

Renaming: JIT renaming

SPE main memory

A[4096] bypass buffer B foo moo

  • riginal object

in main memory buffer A

  • JIT renaming sometimes requires an SPE to bypass from

itself.

slide-27
SLIDE 27

Renaming: JIT renaming

main memory

stage out

  • riginal user data

renaming pool

renaming

  • Decision between stage-out or renaming made at the

very last moment

  • No synchronisation with PPE unless renaming pool

too small

  • Relation between scheduling and renaming

SPE

slide-28
SLIDE 28

Ongoing work

  • verification of the bypassing protocol
  • studying ways to incorporate scheduling
  • distributed scheduling
  • shared representation of the T

ask Dependence Graph (TDG)

slide-29
SLIDE 29

Questions?

slide-30
SLIDE 30

task dependence graph (TDG)

slide-31
SLIDE 31

Speedup results

  • Very much work in progress
  • Linear algebra applications on 16x16 hypermatrices
  • f 64x64 floats
  • Matrix multiplication, 2 variants of the Cholesky

decomposition, a Jacobi computation and an LU decomposition.