Data-Centric Execution of Speculative Parallel Programs MA MARK - - PowerPoint PPT Presentation

data centric execution of speculative parallel programs
SMART_READER_LITE
LIVE PREVIEW

Data-Centric Execution of Speculative Parallel Programs MA MARK - - PowerPoint PPT Presentation

Data-Centric Execution of Speculative Parallel Programs MA MARK JEFFREY, SUVINAY SUBRAMANIAN, MALEEN ABEYDEERA, JOEL EMER, DANIEL SANCHEZ MI MICRO 2016 Executive summary Many-cores must exploit cache locality to scale Current speculative


slide-1
SLIDE 1

Data-Centric Execution of Speculative Parallel Programs

MA MARK JEFFREY, SUVINAY SUBRAMANIAN, MALEEN ABEYDEERA, JOEL EMER, DANIEL SANCHEZ MI MICRO 2016

slide-2
SLIDE 2

Executive summary

Many-cores must exploit cache locality to scale Current speculative systems, e.g. TLS or TM, do not exploit locality Spatial Hints: run tasks likely to access the same data in the same place

  • A software-given hint denotes the data a new task is likely to access
  • Hardware maps tasks with the same hint to the same place
  • Hardware uses hints to perform locality-aware load balancing

Our techniques make speculative parallelism practical at large scale

  • It is easy to modify programs to convey locality through hints
  • Performance improves by 3.3x at 256 cores
  • We reduce network traffic by 6.4x and wasted work by 3.5x

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

2

slide-3
SLIDE 3

Prior speculative systems scale poorly

TRANSACTIONAL MEMORY (TM) SCHEDULERS

Reduce wasted work of coarse-grain txns Limit concurrency: When to run a task?

SPATIAL HINTS

Make accesses local for fine-grain tasks Less data movement: Where to run a task?

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

3

slide-4
SLIDE 4

Prior speculative systems scale poorly

TRANSACTIONAL MEMORY (TM) SCHEDULERS

Reduce wasted work of coarse-grain txns Limit concurrency: When to run a task?

SPATIAL HINTS

Make accesses local for fine-grain tasks Less data movement: Where to run a task?

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

3

Spatially map tasks for improved locality and less waste

slide-5
SLIDE 5

Prior non-speculative locality techniques do not work for speculation

STATIC TASK MAPPING

Data dependences known a priori

  • Linear algebra, Anton 2 [ASPLOS ‘13]

Graph partitioning

  • Localizes communication and scheduling
  • Slow preprocessing step
  • Cannot adapt to imbalance

DYNAMIC TASK MAPPING

Work stealing

  • Cheap, local enqueues
  • Steals to adapt to imbalance
  • Limited application types
  • Stealing interfereswith speculation

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

4

slide-6
SLIDE 6

Baseline Architecture: Swarm [MICRO ‘15]

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

5

slide-7
SLIDE 7

General execution model supports

  • rdered and unordered parallelism

Baseline Swarm execution model

Programs consist of timestamped tasks

  • Tasks can create children tasks with >= timestamp
  • Tasks appear to execute in timestamp order

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

6

swarm::enqueue(function_pointer, timestamp, arguments...);

slide-8
SLIDE 8

Baseline Swarm architecture

Speculatively executes tasks out of order Large hardware task queues Scalable ordered speculation Scalable ordered commits

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

7

64-tile, 256-core chip Tile organization

Core Core Core Core L1I/D L1I/D L1I/D L1I/D

L2 L3 slice

Router

Task unit

Mem / IO Mem / IO Mem / IO Mem / IO

Tile

Efficiently supports tiny speculative tasks

slide-9
SLIDE 9

Spatial Hints in Action

COMBINING SPECULATION AND LOCALITY

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

8

slide-10
SLIDE 10

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

E A B r s t D C

r s t = r XOR s

1 1 1 1

slide-11
SLIDE 11

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

E A B r s t 1 1 D C

r s t = r XOR s

1 1 1 1

slide-12
SLIDE 12

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

E A B r s t 1 1 D C

r s t = r XOR s

1 1 1 1

slide-13
SLIDE 13

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

E A B r s t 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-14
SLIDE 14

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

D0=1

E A B r s t 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-15
SLIDE 15

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

D0=1

E A B r s t 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-16
SLIDE 16

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1

E A B r s t 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-17
SLIDE 17

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1

E A B r s t 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-18
SLIDE 18

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1

E A B r s t 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-19
SLIDE 19

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1

E A B r s t 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-20
SLIDE 20

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1

E A B r s t 1 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-21
SLIDE 21

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1 t=1

E A B r s t 1 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-22
SLIDE 22

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1 t=1

E A B r s t 1 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-23
SLIDE 23

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1 t=1 s=1

E A B r s t 1 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-24
SLIDE 24

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1 t=1 s=1

E A B r s t 1 1 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-25
SLIDE 25

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1

E A B r s t 1 1 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-26
SLIDE 26

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1

E A B r s t 1 1 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-27
SLIDE 27

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1 D1=0

E A B r s t 1 1 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-28
SLIDE 28

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1 D1=0

E A B r s t 1 1 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-29
SLIDE 29

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1 D1=0 E1=0

E A B r s t 1 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-30
SLIDE 30

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1 D1=0 E1=0 t=0

E A B r s t 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-31
SLIDE 31

Example: Discrete event simulation (DES)

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

9

r=1 A=1

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C0=0 D0=1 E1=1 t=1 s=1 C1=1 B=1 D1=0 E1=0 t=0

E A B r s t 1 1 1 1 D C

r s t = r XOR s

1 1 1 1

slide-32
SLIDE 32

Extracting parallelism in DES

Execute independent tasks out of order

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

10

r A

Order = Simulated time (ns)

1 2 3 4 5 6

Tasks

C D E t s C B D E t

Data dependences

r A C D E t s C B D E t

Valid Schedule

2.4x parallelism (more in larger circuits)

Parallelism is plentiful despite data dependences

slide-33
SLIDE 33

1 128 256 Speedup 1c 128c 256c 1 128 256 Speedup 1c 128c 256c

Speculation scales poorly without locality

Swarm sends new tasks to random tiles

  • Good for load balance
  • Poor locality hurts scalability beyond 100 cores

Work stealing: a non-speculative scheduler

  • Enqueuenew tasks locally
  • Steal from the most-loaded tile
  • Not a good strategy for DES

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

11

des

Random Stealing

slide-34
SLIDE 34

Where is the locality?

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

12

Each task operates on a single gate The gate is known when the task is created With fine-grain tasks, most data accessed is known at creation time

r A C D E t s C B D E t

DES Schedule

slide-35
SLIDE 35

1 128 256 Speedup 1c 128c 256c 1 128 256 Speedup 1c 128c 256c

Data-centric speculation scales well

Hints: map each gate to a statically-chosen tile

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

13

des

Stealing Random Hints

But we can do better!

A B D C

  • 1. Less data movement
  • 2. Conflicts are local, cheap, and less frequent

Send new tasks for a gate to its corresponding tile

D

186x

E

E

slide-36
SLIDE 36

1 128 256 Speedup 1c 128c 256c

Load-balanced speculation scales best

Static gate-to-tile mapping may cause hotspots

  • E.g. some gates toggle more frequently

Dynamically remap gates (Hints) across tiles

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

14

1 128 256 Speedup 1c 128c 256c

des

Stealing Random Hints Load-Balanced Hints

Programmer knows most of the data accessed Spatial Hints convey program-level knowledge to exploit locality

236x

slide-37
SLIDE 37

Spatial Hints Implementation

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

15

slide-38
SLIDE 38

Hint mechanisms are straightforward

SOFTWARE

A Spatial Hint is an integer value

  • Given at task creation time
  • Denotes data likely to be accessed by the task
  • E.g. the gate ID in DES

HARDWARE

Hashes each new task’s Hint to a tile ID Serializes same-Hint tasks

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

16

7 4 1 1

Localize most data accesses within a tile Serialize tasks likely to conflict

slide-39
SLIDE 39

Load balance with a level of indirection

Static hint-to-tile mapping may cause imbalance

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

17

Tile ID 2 H Hint 0xF00 2 Bucket H Reconfigurable Tile Map Tile ID Hint 0xF00

1 7 1 … 61 63 40

Instead, periodically remap hints across tiles to equalize load

slide-40
SLIDE 40

“Load” is different for speculation

Non-speculative systems use # queued tasks as a proxy for load When imbalanced, speculative systems often

  • Don’t run out of work
  • Abort more work or strain speculation resources

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

18

Remap hints to tiles to balance # of committed cycles per tile

2 Bucket H Reconfigurable Tile Map Tile ID Hint 0xF00

1 7 1 … 61 63 40

slide-41
SLIDE 41

Adding hints to applications is easy

void desTask(Timestamp ts, GateInput* input) { Gate* g = input->gate(); bool toggledOutput = g.simulateToggle(input); if (toggledOutput) { // Toggle all inputs connected to this gate for (GateInput* i : g->connectedInputs()) swarm::enqueue(desTask,

/*Timestamp*/ ts + delay(g, i),

/*Hint*/ i->gate()->id, i); } }

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

19

One line of code to express the Gate ID as a Hint

slide-42
SLIDE 42

Benchmark Hint Why? des Gate ID Map tasks for same gate to same tile nocsim Router ID Frequent intra-router communication bfs, sssp, astar, color Cache-line address Several vertices reside on the same line silo (Table ID, primary key) Each task accesses one database tuple genome, kmeans Multiple

Adding hints to applications is easy

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

20

slide-43
SLIDE 43

See the paper for more details!

Load balance reconfiguration algorithm Choice of application hints Relationship between task size and hint effectiveness

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

21

slide-44
SLIDE 44

Evaluation

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

22

slide-45
SLIDE 45

Methodology

Event-driven, Pin-based simulator Target system: 256-core, 64-tile chip Scalability experiments from 1–256 cores

  • Scaled-down systems have fewer tiles

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

23

Core Core Core Core L1I/D L1I/D L1I/D L1I/D

L2 L3 slice

Router

Task unit

Mem / IO Mem / IO Mem / IO Mem / IO

Tile

64 MB shared L3 (1MB/tile) 256 KB per-tile L2s 16 KB per-core L1s 16K task queue entries (64/core) 4K commit queue entries (16/core) In-order, single-issue, scoreboarded

slide-46
SLIDE 46

Load-Balanced Hints 3.3x faster than Random (193x gmean vs 58x)

1 256 512 Speedup

bfs

1 256 512

sssp

1 128 256

astar

1 64 128 Speedup

color

1 128 256

des

1 256 512

nocsim

1 128 256 Speedup 1c 128c 256c

silo

1 64 128 1c 128c 256c

genome

1 128 256 1c 128c 256c

kmeans Random

1 256 512 Speedup

bfs

1 256 512

sssp

1 128 256

astar

1 64 128 Speedup

color

1 128 256

des

1 256 512

nocsim

1 128 256 Speedup 1c 128c 256c

silo

1 64 128 1c 128c 256c

genome

1 128 256 1c 128c 256c

kmeans Hints Random

1 256 512 Speedup

bfs

1 256 512

sssp

1 128 256

astar

1 64 128 Speedup

color

1 128 256

des

1 256 512

nocsim

1 128 256 Speedup 1c 128c 256c

silo

1 64 128 1c 128c 256c

genome

1 128 256 1c 128c 256c

kmeans LBHints Hints Random

1 256 512 Speedup

bfs

1 256 512

sssp

1 128 256

astar

1 64 128 Speedup

color

1 128 256

des

1 256 512

nocsim

1 128 256 Speedup 1c 128c 256c

silo

1 64 128 1c 128c 256c

genome

1 128 256 1c 128c 256c

kmeans LBHints Hints Random Stealing

Hints make speculation practical

  • n large-scale systems

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

24

Stealing is inconsistent across benchmarks Load-Balanced Hints 17% – 27% faster than Hints

slide-47
SLIDE 47

Hints make speculation more efficient

0.0 0.2 0.4 0.6 0.8 1.0

Aborted Cycles

R L R L R L R L R L R L R L R L R L

bfs sssp astar color des nocsim silo genome kmeans DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

25

0.0 0.2 0.4 0.6 0.8 1.0

NoC data transferred

R L R L R L R L R L R L R L R L R L

bfs sssp astar color des nocsim silo genome kmeans

Reduce wasted work by 6.4x Reduce network traffic by 3.5x

slide-48
SLIDE 48

Conclusion

Speculative architectures must exploit locality to scale to 100s of cores

  • Important to simplify parallel programming

Spatial Hints convey app-level knowledge to exploit cache locality Hardware leverages hints by:

  • Sending tasks likely to access the same data to the same tile
  • Serializing tasks likely to conflict
  • Balancing work in a locality-aware and speculation-friendly way

Our techniques make speculation practical on large-scale systems

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

26

slide-49
SLIDE 49

Thank you! Questions?

Speculative architectures must exploit locality to scale to 100s of cores

  • Important to simplify parallel programming

Spatial Hints convey app-level knowledge to exploit cache locality Hardware leverages hints by:

  • Sending tasks likely to access the same data to the same tile
  • Serializing tasks likely to conflict
  • Balancing work in a locality-aware and speculation-friendly way

Our techniques make speculation practical on large-scale systems

DATA-CENTRIC EXECUTION OF SPECULATIVE PARALLEL PROGRAMS

27