P ARALLELISM ON H ETEROGENEOUS C ACHE - C OHERENT S YSTEMS Moyang - - PowerPoint PPT Presentation

p arallelism on h eterogeneous c ache
SMART_READER_LITE
LIVE PREVIEW

P ARALLELISM ON H ETEROGENEOUS C ACHE - C OHERENT S YSTEMS Moyang - - PowerPoint PPT Presentation

E FFICIENTLY S UPPORTING D YNAMIC T ASK P ARALLELISM ON H ETEROGENEOUS C ACHE - C OHERENT S YSTEMS Moyang Wang, Tuan Ta, Lin Cheng, Christopher Batten Computer Systems Laboratory Cornell University Page 1 of 26 M ANYCORE P ROCESSORS Motivation


slide-1
SLIDE 1

EFFICIENTLY SUPPORTING DYNAMIC TASK PARALLELISM ON HETEROGENEOUS CACHE- COHERENT SYSTEMS

Moyang Wang, Tuan Ta, Lin Cheng, Christopher Batten

Page 1 of 26

Computer Systems Laboratory Cornell University

slide-2
SLIDE 2

MANYCORE PROCESSORS

Small Core Count Large Core Count

Page 1 of 26

Hardware-Based Cache Coherence Software-Centric Cache Coherence / No Coherence Tilera TILE64 64 Cores Intel Xeon Phi 72 Cores Cavium ThunderX 48 Cores NVIDIA GV100 GPU 72 SM Celerity 511 Cores Adapteva Epiphany 1024 Cores KiloCore 1000 Cores

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-3
SLIDE 3

SOFTWARE CHALLENGE

Page 2 of 22

int fib( int n ) { if ( n < 2 ) return n; int x, y; tbb::parallel_invoke( [&] { x = fib( n - 1 ); }, [&] { y = fib( n - 2 ); } ); return (x + y); } int fib( int n ) { if ( n < 2 ) return n; int x = cilk_spawn fib( n - 1 ); int y = fib( n - 2 ); cilk_sync; return (x + y); }

Intel TBB Intel Cilk Plus

  • Programmers expect to use

familiar shared-memory programming models on manycore processors

  • Even more difficult to allow

cooperative execution between host processor and manycore co-processor

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-4
SLIDE 4

SOFTWARE CHALLENGE

Page 2 of 22

int fib( int n ) { if ( n < 2 ) return n; int x, y; tbb::parallel_invoke( [&] { x = fib( n - 1 ); }, [&] { y = fib( n - 2 ); } ); return (x + y); } int fib( int n ) { if ( n < 2 ) return n; int x = cilk_spawn fib( n - 1 ); int y = fib( n - 2 ); cilk_sync; return (x + y); }

Intel TBB Intel Cilk Plus

  • Programmers expect to use

familiar shared-memory programming models on manycore processors

  • Even more difficult to allow

cooperative execution between host processor and manycore co-processor

Host Processor

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-5
SLIDE 5

CONTRIBUTIONS

Page 3 of 26

  • Work-Stealing Runtime for manycore

processors with heterogeneous cache coherence (HCC)

  • TBB/Cilk-like programming model
  • Efficient cooperative execution

between big and tiny cores

  • Direct task stealing (DTS), a lightweight

software and hardware technique to improve performance and energy efficiency

  • Detailed cycle-level evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-6
SLIDE 6

EFFICIENTLY SUPPORTING DYNAMIC TASK PARALLELISM ON HCC

Page 4 of 26

  • Background
  • Implementing Work-Stealing

Runtimes on HCC

  • Direct Task Stealing
  • Evaluation

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

slide-7
SLIDE 7

HETEROGENEOUS CACHE COHERENCE (HCC)

Page 5 of 26

  • We study three exemplary software-

centric cache coherence protocols:

  • DeNovo [1]
  • GPU Write-Through (GPU-WT)
  • GPU Write-Back (GPU-WB)
  • They vary in their strategies to invalidate

stale data and propagate dirty data

  • Prior work on Spandex [2] has studied how

to efficiently integrate different protocols into HCC systems

Stale Data Invalidation Dirty Data Propagation Write Granularity MESI Writer Owner, Write-Back Cache Line DeNovo Reader Owner, Write-Back Flexible GPU-WT Reader No-Owner, Write-Through Word GPU-WB Reader No-Owner, Write-Back Word

[1] H. Sung and S. V. Adve. DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations. ASPLOS 2015. [2] J. Alsop, M. Sinclair, and S. V. Adve. Spandex: A Flexible Interface for Efficient Heterogeneous Coherence. ISCA 2018.

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-8
SLIDE 8

DYNAMIC TASK PARALLELISM

Page 6 of 26

  • Tasks are generated dynamically at run-time
  • Diverse current and emerging parallel patterns:
  • Map (for-each)
  • Fork-join
  • Nesting
  • Supported by popular frameworks:
  • Intel Threading Building Blocks (TBB)
  • Intel Cilk Plus
  • OpenMP
  • Work-stealing runtimes provide automatic

load-balancing

Pictures from Robinson et al., Structured Parallel Programming: Patterns for Efficient Computation, 2012

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-9
SLIDE 9

DYNAMIC TASK PARALLELISM

Page 6 of 26

  • Tasks are generated dynamically at run-time
  • Diverse current and emerging parallel patterns:
  • Map (for-each)
  • Fork-join
  • Nesting
  • Supported by popular frameworks:
  • Intel Threading Building Blocks (TBB)
  • Intel Cilk Plus
  • OpenMP
  • Work-stealing runtimes provide automatic

load-balancing

long fib( int n ) { if ( n < 2 ) return n; long x, y; parallel_invoke( [&] { x = fib( n - 1 ); }, [&] { y = fib( n - 2 ); } ); return (x + y); } void vvadd( int a[], int b[], int dst[], int n ) { parallel_for( 0, n, [&]( int i ) { dst[i] = a[i] + b[i]; }); }

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-10
SLIDE 10

DYNAMIC TASK PARALLELISM

Page 6 of 26

  • Tasks are generated dynamically at run-time
  • Diverse current and emerging parallel patterns:
  • Map (for-each)
  • Fork-join
  • Nesting
  • Supported by popular frameworks:
  • Intel Threading Building Blocks (TBB)
  • Intel Cilk Plus
  • OpenMP
  • Work-stealing runtimes provide automatic

load-balancing

class FibTask : public task { int n, *sum; void execute() { if ( n < 2 ) { *sum = n; return; } long x, y; FibTask a( n - 1, &x ); FibTask b( n - 2, &y ); this->reference_count = 2; task::spawn( &a ); task::spawn( &b ); task::wait( this ); *sum = x + y; } }

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-11
SLIDE 11

WORK-STEALING RUNTIMES

Page 7 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

Check local task queue

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-12
SLIDE 12

WORK-STEALING RUNTIMES

Page 7 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

Check local task queue Execute dequeued task

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-13
SLIDE 13

WORK-STEALING RUNTIMES

Page 7 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

Check local task queue Execute dequeued task Steal from another queue

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-14
SLIDE 14

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3

WORK-STEALING RUNTIMES

Page 7 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-15
SLIDE 15

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task A

WORK-STEALING RUNTIMES

Page 7 of 26

A

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-16
SLIDE 16

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task B Task A Spawn Task B

WORK-STEALING RUNTIMES

Page 7 of 26

A B

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-17
SLIDE 17

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task B Dequeue Task B

WORK-STEALING RUNTIMES

Page 7 of 26

A B

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-18
SLIDE 18

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task B Spawn Task C

WORK-STEALING RUNTIMES

Page 7 of 26

A B C

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-19
SLIDE 19

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task B Spawn Task D Task D Task C

WORK-STEALING RUNTIMES

Page 7 of 26

A B C D

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-20
SLIDE 20

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task D Steal Task D Steal Task C Task B

WORK-STEALING RUNTIMES

Page 7 of 26

A B C D

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-21
SLIDE 21

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task D Task E Task F Spawn Task F Spawn Task E

WORK-STEALING RUNTIMES

Page 7 of 26

A B C D F E

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-22
SLIDE 22

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task D Task E Task F Steal Task E Steal Task F

WORK-STEALING RUNTIMES

Page 7 of 26

A B C D F E

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-23
SLIDE 23

EFFICIENTLY SUPPORTING DYNAMIC TASK PARALLELISM ON HCC

Page 8 of 22

  • Background
  • Implementing Work-Stealing

Runtimes on HCC

  • Direct Task Stealing
  • Evaluation

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

slide-24
SLIDE 24

Page 9 of 22

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

WORK-STEALING RUNTIMES ON SOFTWARE-CENTRIC CACHE COHERENCE

  • Shared task queues must be coherent
  • DAG-Consistency [1]:
  • Child tasks read up-to-date data

from parent

  • Parent read up-to-date data from

(finished) children

[1] R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, and K. H.

  • Randall. An Analysis of Dag-Consistent Distributed Shared-Memory
  • Algorithms. SPAA 1996.
slide-25
SLIDE 25

WORK-STEALING RUNTIMES ON SOFTWARE-CENTRIC CACHE COHERENCE

Page 10 of 22

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

  • Supporting shared queues:
  • Lock-acquire -> invalidation
  • Lock-release -> cache flush

cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-26
SLIDE 26

WORK-STEALING RUNTIMES ON SOFTWARE-CENTRIC CACHE COHERENCE

Page 10 of 22

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

  • Supporting shared queues:
  • Lock-acquire -> invalidation
  • Lock-release -> cache flush
  • Stolen task on HCC:
  • Invalidate before execution
  • Flush after execution

cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-27
SLIDE 27

WORK-STEALING RUNTIMES ON SOFTWARE-CENTRIC CACHE COHERENCE

Page 10 of 22

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

  • Supporting shared queues:
  • Lock-acquire -> invalidation
  • Lock-release -> cache flush
  • Stolen task on HCC:
  • Invalidate before execution
  • Flush after execution
  • Ensure parent-child

synchronization

cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-28
SLIDE 28

WORK-STEALING RUNTIMES ON SOFTWARE-CENTRIC CACHE COHERENCE

Page 10 of 22

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

  • Supporting shared queues:
  • Lock-acquire -> invalidation
  • Lock-release -> cache flush
  • Stolen task on HCC:
  • Invalidate before execution
  • Flush after execution
  • Ensure parent-child

synchronization

  • No-op when invalidation or

flush is not required

cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-29
SLIDE 29

Page 11 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

COOPERATIVE EXECUTION

  • Same runtime loop runs on both big

and tiny cores

  • Invalidations and flushes are no-ops on

big cores with MESI

  • Enables seamless work-stealing

between big cores and tiny core

slide-30
SLIDE 30

Page 11 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

COOPERATIVE EXECUTION

  • Same runtime loop runs on both big

and tiny cores

  • Invalidations and flushes are no-ops on

big cores with MESI

  • Enables seamless work-stealing

between big cores and tiny core

Task

slide-31
SLIDE 31

Page 11 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

COOPERATIVE EXECUTION

  • Same runtime loop runs on both big

and tiny cores

  • Invalidations and flushes are no-ops on

big cores with MESI

  • Enables seamless work-stealing

between big cores and tiny core

Task

slide-32
SLIDE 32

Page 11 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

COOPERATIVE EXECUTION

  • Same runtime loop runs on both big

and tiny cores

  • Invalidations and flushes are no-ops on

big cores with MESI

  • Enables seamless work-stealing

between big cores and tiny core

Task Task

slide-33
SLIDE 33

Page 11 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

COOPERATIVE EXECUTION

  • Same runtime loop runs on both big

and tiny cores

  • Invalidations and flushes are no-ops on

big cores with MESI

  • Enables seamless work-stealing

between big cores and tiny core

Task Task

slide-34
SLIDE 34

Page 11 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

COOPERATIVE EXECUTION

  • Same runtime loop runs on both big

and tiny cores

  • Invalidations and flushes are no-ops on

big cores with MESI

  • Enables seamless work-stealing

between big cores and tiny core

Task Task Task

slide-35
SLIDE 35

Page 11 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

COOPERATIVE EXECUTION

  • Same runtime loop runs on both big

and tiny cores

  • Invalidations and flushes are no-ops on

big cores with MESI

  • Enables seamless work-stealing

between big cores and tiny core

Task Task Task

slide-36
SLIDE 36

Page 11 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

COOPERATIVE EXECUTION

  • Same runtime loop runs on both big

and tiny cores

  • Invalidations and flushes are no-ops on

big cores with MESI

  • Enables seamless work-stealing

between big cores and tiny core

Task Task Task Task

slide-37
SLIDE 37

Page 11 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

COOPERATIVE EXECUTION

  • Same runtime loop runs on both big

and tiny cores

  • Invalidations and flushes are no-ops on

big cores with MESI

  • Enables seamless work-stealing

between big cores and tiny core

Task Task Task Task

slide-38
SLIDE 38

Page 11 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

COOPERATIVE EXECUTION

  • Same runtime loop runs on both big

and tiny cores

  • Invalidations and flushes are no-ops on

big cores with MESI

  • Enables seamless work-stealing

between big cores and tiny core

Task Task Task Task Task

slide-39
SLIDE 39

Page 11 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

COOPERATIVE EXECUTION

  • Same runtime loop runs on both big

and tiny cores

  • Invalidations and flushes are no-ops on

big cores with MESI

  • Enables seamless work-stealing

between big cores and tiny core

Task Task Task Task Task

slide-40
SLIDE 40

Page 11 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

COOPERATIVE EXECUTION

  • Same runtime loop runs on both big

and tiny cores

  • Invalidations and flushes are no-ops on

big cores with MESI

  • Enables seamless work-stealing

between big cores and tiny core

Task Task Task Task Task Task Task Task Task

slide-41
SLIDE 41

EFFICIENTLY SUPPORTING DYNAMIC TASK PARALLELISM ON HCC

Page 12 of 26

  • Background
  • Implementing Work-Stealing

Runtimes on HCC

  • Direct Task Stealing
  • Evaluation

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

slide-42
SLIDE 42

THE OVERHEADS OF WORK-STEALING RUNTIMES ON HCC

Page 13 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

  • Invalidation and/or flush on

all accesses to task queues

  • Only need to maintain data

consistency between parent and child.

  • In work-stealing runtimes,

steals are relatively rare, but every task can be stolen.

  • Hard to know whether child

tasks are stolen

  • Cost of AMOs.

cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-43
SLIDE 43

IMPROVING WORK-STEALING RUNTIMES ON HCC

Page 14 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

  • What we want to achieve

cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-44
SLIDE 44

IMPROVING WORK-STEALING RUNTIMES ON HCC

Page 14 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

  • What we want to achieve
  • No Inv/Flush when

accessing the local task queue

cache_invalidate(); cache_flush(); cache_invalidate();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-45
SLIDE 45

IMPROVING WORK-STEALING RUNTIMES ON HCC

Page 14 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

  • What we want to achieve
  • No Inv/Flush when

accessing the local task queue

  • No invalidation if children

not stolen

cache_invalidate(); cache_flush();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-46
SLIDE 46

IMPROVING WORK-STEALING RUNTIMES ON HCC

Page 14 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

  • What we want to achieve
  • No Inv/Flush when

accessing the local task queue

  • No invalidation if children

not stolen

  • No AMO if child not stolen

cache_invalidate(); cache_flush();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-47
SLIDE 47

IMPROVING WORK-STEALING RUNTIMES ON HCC

Page 14 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }

  • What we want to achieve
  • No Inv/Flush when

accessing the local task queue

  • No invalidation if children

not stolen

  • No AMO if child not stolen
  • Our technique: direct task

stealing (DTS) instead of indirect task stealing through shared task queues

cache_invalidate(); cache_flush();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-48
SLIDE 48

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

USER-LEVEL INTERRUPT (ULI)

Page 15 of 26

  • DTS is based on lightweight

inter-processor user-level interrupt.

  • Included in recent ISAs (e.g.

RISC-V).

  • Similar to active messages

[1] and ADM [2].

[1] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. ISCA 1992. [2] D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible Architectural Support for Fine-Grain Scheduling. ASPLOS 2010.

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

slide-49
SLIDE 49

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

USER-LEVEL INTERRUPT (ULI)

Page 15 of 26

  • DTS is based on lightweight

inter-processor user-level interrupt.

  • Included in recent ISAs (e.g.

RISC-V).

  • Similar to active messages

[1] and ADM [2].

[1] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. ISCA 1992. [2] D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible Architectural Support for Fine-Grain Scheduling. ASPLOS 2010.

Send Interrupt

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

slide-50
SLIDE 50

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

USER-LEVEL INTERRUPT (ULI)

Page 15 of 26

  • DTS is based on lightweight

inter-processor user-level interrupt.

  • Included in recent ISAs (e.g.

RISC-V).

  • Similar to active messages

[1] and ADM [2].

[1] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. ISCA 1992. [2] D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible Architectural Support for Fine-Grain Scheduling. ASPLOS 2010.

Send Interrupt Jump to a handler

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

slide-51
SLIDE 51

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

USER-LEVEL INTERRUPT (ULI)

Page 15 of 26

  • DTS is based on lightweight

inter-processor user-level interrupt.

  • Included in recent ISAs (e.g.

RISC-V).

  • Similar to active messages

[1] and ADM [2].

[1] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. ISCA 1992. [2] D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible Architectural Support for Fine-Grain Scheduling. ASPLOS 2010.

Send Interrupt Jump to a handler Send an ACK/NACK

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

slide-52
SLIDE 52

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

Page 16 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

IMPLEMENTING DIRECT-TASK STEALING WITH ULI P P

Parent Task

Victim Thief Work In-Progress Task Queues (Private) Shared LLC

slide-53
SLIDE 53

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

Page 16 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

IMPLEMENTING DIRECT-TASK STEALING WITH ULI P P

Parent Task

Victim Thief Work In-Progress Task Queues (Private) Shared LLC

Child Task

slide-54
SLIDE 54

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

Page 16 of 26

Send Interrupt

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

IMPLEMENTING DIRECT-TASK STEALING WITH ULI P P

Parent Task

Victim Thief Work In-Progress Task Queues (Private) Shared LLC

Child Task

ULI with id

slide-55
SLIDE 55

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

Page 16 of 26

Send Interrupt Jump to a handler

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

IMPLEMENTING DIRECT-TASK STEALING WITH ULI P P

Parent Task

Victim Thief Work In-Progress Task Queues (Private) Shared LLC

Child Task

ULI with id

slide-56
SLIDE 56

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

Page 16 of 26

Send Interrupt Jump to a handler

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

IMPLEMENTING DIRECT-TASK STEALING WITH ULI P P

Parent Task

Victim Thief Work In-Progress Task Queues (Private) Shared LLC

Child Task

Flush ULI with id

slide-57
SLIDE 57

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

Page 16 of 26

Send Interrupt Jump to a handler

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

IMPLEMENTING DIRECT-TASK STEALING WITH ULI P P

Parent Task

Victim Thief Work In-Progress Task Queues (Private) Shared LLC

Child Task

Flush ULI with id

slide-58
SLIDE 58

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

Page 16 of 26

Send Interrupt Jump to a handler

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

IMPLEMENTING DIRECT-TASK STEALING WITH ULI P P

Parent Task

Victim Thief Work In-Progress Task Queues (Private) Shared LLC

Child Task

Flush

*

ULI with id

slide-59
SLIDE 59

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

Page 16 of 26

Send Interrupt Jump to a handler Send an ACK/NACK

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

IMPLEMENTING DIRECT-TASK STEALING WITH ULI P P

Parent Task

Victim Thief Work In-Progress Task Queues (Private) Shared LLC

Child Task

Flush

*

ULI with id ACK

slide-60
SLIDE 60

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

Page 16 of 26

Send Interrupt Jump to a handler Send an ACK/NACK

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

IMPLEMENTING DIRECT-TASK STEALING WITH ULI P P

Parent Task

Victim Thief Work In-Progress Task Queues (Private) Shared LLC

Child Task

Flush

*

ULI with id ACK

slide-61
SLIDE 61

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

Page 16 of 26

Send Interrupt Jump to a handler Send an ACK/NACK

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

IMPLEMENTING DIRECT-TASK STEALING WITH ULI P P

Parent Task

Victim Thief Work In-Progress Task Queues (Private) Shared LLC

Child Task

Flush

*

ULI with id ACK

slide-62
SLIDE 62

WORK-STEALING RUNTIMES WITH DTS

Page 17 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task* t = task_queue[tid].dequeue(); if (t) { t->execute(); if (t->parent->child_stolen) amo_sub( t->parent->ref_count, 1 ); else t->parent->ref_count -= 1; } else { t = steal_using_dts(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } if ( p->has_stolen_child ) cache_invalidate(); } cache_invalidate(); cache_flush();

  • DTS achieves:

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-63
SLIDE 63

WORK-STEALING RUNTIMES WITH DTS

Page 17 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task* t = task_queue[tid].dequeue(); if (t) { t->execute(); if (t->parent->child_stolen) amo_sub( t->parent->ref_count, 1 ); else t->parent->ref_count -= 1; } else { t = steal_using_dts(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } if ( p->has_stolen_child ) cache_invalidate(); } cache_invalidate(); cache_flush();

  • DTS achieves:
  • Access task queues

without locking

uli_disable(); uli_enable();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-64
SLIDE 64

WORK-STEALING RUNTIMES WITH DTS

Page 17 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task* t = task_queue[tid].dequeue(); if (t) { t->execute(); if (t->parent->child_stolen) amo_sub( t->parent->ref_count, 1 ); else t->parent->ref_count -= 1; } else { t = steal_using_dts(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } if ( p->has_stolen_child ) cache_invalidate(); } cache_invalidate(); cache_flush();

  • DTS achieves:
  • Access task queues

without locking

  • No AMO unless the

parent has a child stolen

uli_disable(); uli_enable();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-65
SLIDE 65

WORK-STEALING RUNTIMES WITH DTS

Page 17 of 26

void task::wait( task* p ) { while ( p->ref_count > 0 ) { task* t = task_queue[tid].dequeue(); if (t) { t->execute(); if (t->parent->child_stolen) amo_sub( t->parent->ref_count, 1 ); else t->parent->ref_count -= 1; } else { t = steal_using_dts(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } if ( p->has_stolen_child ) cache_invalidate(); } cache_invalidate(); cache_flush();

  • DTS achieves:
  • Access task queues

without locking

  • No AMO unless the

parent has a child stolen

  • No invalidation unless a

child is stolen

uli_disable(); uli_enable();

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-66
SLIDE 66

EFFICIENTLY SUPPORTING DYNAMIC TASK PARALLELISM ON HCC

Page 18 of 26

  • Background
  • Implementing Work-Stealing

Runtimes on HCC

  • Direct Task Stealing
  • Evaluation

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

slide-67
SLIDE 67

EVALUATION METHODOLOGY

Page 19 of 26

  • gem5 (Ruby and Garnet2.0) cycle-Level simulator
  • 4 big core: OOO, 64KB L1D cache
  • 60 tiny core: in-order, 4KB L1D cache
  • Total cache capacity: 16 tiny cores = 1 big core
  • Baselines:
  • O3x8: eight big cores
  • big.TINY/MESI
  • big.TINY with HCC:
  • big.TINY/HCC
  • big.TINY/HCC-DTS

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

slide-68
SLIDE 68

EVALUATION METHODOLOGY

Page 20 of 26

  • 13 dynamic task-parallel application kernels

from Cilk-5 and Ligra benchmark suites

  • Optimize task granularity for the

big.TINY/MESI baseline

  • We use moderate input data sizes and

moderate parallelism on a 64-core system to be representative of larger systems running larger input sizes (weak scaling)

  • See paper for 256-core case study to

validate our weak-scaling claim

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-69
SLIDE 69

PERFORMANCE: BIG.TINY/MESI VS. O3X8

Page 21 of 26

Name Input cilk5-cs 3000000 cilk5-lu 128 cilk5-mm 256 cilk5-mt 8000 cilk5-nq 10 ligra-bc rMat_100K ligra-bf rMat_200K ligra-bfs rMat_800K ligra-bfsbv rMat_500K ligra-cc rMat_500K ligra-mis rMat_100K ligra-radii rMat_200K ligra-tc rMat_200K geomean Speedup over Serial IO b.T/ O3×1 O3×4 O3×8 MESI 1.65 4.92 9.78 18.70 2.48 9.46 17.24 23.93 11.38 11.76 22.04 41.23 5.71 19.70 39.94 57.43 1.57 3.87 7.03 2.93 2.05 6.29 13.06 11.48 1.80 5.36 11.25 12.80 2.23 6.23 12.70 15.63 1.91 6.17 12.25 14.42 3.00 9.11 20.66 24.12 2.43 7.70 15.61 19.01 2.80 8.17 17.89 25.94 1.49 4.99 10.89 23.21 2.56 7.26 14.70 16.94

  • Work-Stealing runtimes enable

cooperative execution between big and tiny cores

  • Total cache capacity: 4 big cores + 60

tiny cores = 7.8 big cores

  • big.TINY achieves better performance

by exploiting parallelism and cooperative execution

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-70
SLIDE 70

PERFORMANCE: BIG.TINY/MESI VS. O3X8

Page 21 of 26

Name Input cilk5-cs 3000000 cilk5-lu 128 cilk5-mm 256 cilk5-mt 8000 cilk5-nq 10 ligra-bc rMat_100K ligra-bf rMat_200K ligra-bfs rMat_800K ligra-bfsbv rMat_500K ligra-cc rMat_500K ligra-mis rMat_100K ligra-radii rMat_200K ligra-tc rMat_200K geomean Speedup over Serial IO b.T/ O3×1 O3×4 O3×8 MESI 1.65 4.92 9.78 18.70 2.48 9.46 17.24 23.93 11.38 11.76 22.04 41.23 5.71 19.70 39.94 57.43 1.57 3.87 7.03 2.93 2.05 6.29 13.06 11.48 1.80 5.36 11.25 12.80 2.23 6.23 12.70 15.63 1.91 6.17 12.25 14.42 3.00 9.11 20.66 24.12 2.43 7.70 15.61 19.01 2.80 8.17 17.89 25.94 1.49 4.99 10.89 23.21 2.56 7.26 14.70 16.94

  • Work-Stealing runtimes enable

cooperative execution between big and tiny cores

  • Total cache capacity: 4 big cores + 60

tiny cores = 7.8 big cores

  • big.TINY achieves better performance

by exploiting parallelism and cooperative execution

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

slide-71
SLIDE 71

PERFORMANCE: BIG.TINY/HCC VS. BIG.TINY/MESI

Page 22 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

Big cores always use MESI, tiny cores use:

  • dnv = DeNovo
  • gwt = GPU-WT
  • gwb = GPU-WB
  • HCC configurations has slightly worse performance than

big.TINY/MESI

  • DTS improves performance of work-stealing runtimes on HCC
slide-72
SLIDE 72

EXECUTION TIME BREAKDOWN: BIG.TINY/HCC VS. BIG.TINY/MESI

Page 23 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

Big cores always use MESI, tiny cores use:

  • dnv = DeNovo
  • gwt = GPU-WT
  • gwb = GPU-WB
  • The overhead of HCC comes from data load, data store, and AMO
  • DTS mitigates these overheads
slide-73
SLIDE 73

EXECUTION TIME BREAKDOWN: BIG.TINY/HCC VS. BIG.TINY/MESI

Page 23 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

Big cores always use MESI, tiny cores use:

  • dnv = DeNovo
  • gwt = GPU-WT
  • gwb = GPU-WB
  • The overhead of HCC comes from data load, data store, and AMO
  • DTS mitigates these overheads
slide-74
SLIDE 74

EFFECTS OF DTS

Page 24 of 26

  • DTS reduces the number of

cache invalidations

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

Invalidation Decrease (%) Flush Decrease (%) Hit Rate Increase (%) App dnv gwt gwb gwb dnv gwt gwb cilk5-cs 99.42 99.28 99.50 98.86 1.80 2.45 1.30 cilk5-lu 98.83 99.78 99.53 98.40 1.12 7.12 2.94 cilk5-mm 99.22 99.67 99.62 99.12 30.03 42.19 36.80 cilk5-mt 99.88 99.73 99.93 99.82 12.45 2.70 6.56 cilk5-nq 97.74 97.88 98.32 95.84 16.84 28.87 27.04 ligra-bc 94.89 97.04 97.33 93.80 7.64 21.43 14.99 ligra-bf 29.02 38.14 40.24 21.63 7.22 17.14 11.17 ligra-bfs 94.18 95.85 95.90 91.23 3.48 15.76 8.00 ligra-bfsbv 39.31 47.36 50.74 29.46 3.10 12.65 7.56 ligra-cc 98.03 98.17 98.16 95.89 3.11 11.11 6.17 ligra-mis 97.35 98.28 98.36 96.16 5.62 16.29 11.10 ligra-radii 95.97 98.17 98.19 95.75 3.62 11.00 7.03 ligra-tc 10.83 15.99 17.02 7.52 1.59 3.55 3.02

slide-75
SLIDE 75

EFFECTS OF DTS

Page 24 of 26

  • DTS reduces the number of

cache invalidations

  • DTS reduces the number of

cache flushes

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

Invalidation Decrease (%) Flush Decrease (%) Hit Rate Increase (%) App dnv gwt gwb gwb dnv gwt gwb cilk5-cs 99.42 99.28 99.50 98.86 1.80 2.45 1.30 cilk5-lu 98.83 99.78 99.53 98.40 1.12 7.12 2.94 cilk5-mm 99.22 99.67 99.62 99.12 30.03 42.19 36.80 cilk5-mt 99.88 99.73 99.93 99.82 12.45 2.70 6.56 cilk5-nq 97.74 97.88 98.32 95.84 16.84 28.87 27.04 ligra-bc 94.89 97.04 97.33 93.80 7.64 21.43 14.99 ligra-bf 29.02 38.14 40.24 21.63 7.22 17.14 11.17 ligra-bfs 94.18 95.85 95.90 91.23 3.48 15.76 8.00 ligra-bfsbv 39.31 47.36 50.74 29.46 3.10 12.65 7.56 ligra-cc 98.03 98.17 98.16 95.89 3.11 11.11 6.17 ligra-mis 97.35 98.28 98.36 96.16 5.62 16.29 11.10 ligra-radii 95.97 98.17 98.19 95.75 3.62 11.00 7.03 ligra-tc 10.83 15.99 17.02 7.52 1.59 3.55 3.02

slide-76
SLIDE 76

EFFECTS OF DTS

Page 24 of 26

  • DTS reduces the number of

cache invalidations

  • DTS reduces the number of

cache flushes

  • DTS improves L1 hit rate

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

Invalidation Decrease (%) Flush Decrease (%) Hit Rate Increase (%) App dnv gwt gwb gwb dnv gwt gwb cilk5-cs 99.42 99.28 99.50 98.86 1.80 2.45 1.30 cilk5-lu 98.83 99.78 99.53 98.40 1.12 7.12 2.94 cilk5-mm 99.22 99.67 99.62 99.12 30.03 42.19 36.80 cilk5-mt 99.88 99.73 99.93 99.82 12.45 2.70 6.56 cilk5-nq 97.74 97.88 98.32 95.84 16.84 28.87 27.04 ligra-bc 94.89 97.04 97.33 93.80 7.64 21.43 14.99 ligra-bf 29.02 38.14 40.24 21.63 7.22 17.14 11.17 ligra-bfs 94.18 95.85 95.90 91.23 3.48 15.76 8.00 ligra-bfsbv 39.31 47.36 50.74 29.46 3.10 12.65 7.56 ligra-cc 98.03 98.17 98.16 95.89 3.11 11.11 6.17 ligra-mis 97.35 98.28 98.36 96.16 5.62 16.29 11.10 ligra-radii 95.97 98.17 98.19 95.75 3.62 11.00 7.03 ligra-tc 10.83 15.99 17.02 7.52 1.59 3.55 3.02

slide-77
SLIDE 77

EFFECTS OF DTS

Page 24 of 26

  • DTS reduces the number of

cache invalidations

  • DTS reduces the number of

cache flushes

  • DTS improves L1 hit rate
  • DTS improves overall

performance

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

Invalidation Decrease (%) Flush Decrease (%) Hit Rate Increase (%) App dnv gwt gwb gwb dnv gwt gwb cilk5-cs 99.42 99.28 99.50 98.86 1.80 2.45 1.30 cilk5-lu 98.83 99.78 99.53 98.40 1.12 7.12 2.94 cilk5-mm 99.22 99.67 99.62 99.12 30.03 42.19 36.80 cilk5-mt 99.88 99.73 99.93 99.82 12.45 2.70 6.56 cilk5-nq 97.74 97.88 98.32 95.84 16.84 28.87 27.04 ligra-bc 94.89 97.04 97.33 93.80 7.64 21.43 14.99 ligra-bf 29.02 38.14 40.24 21.63 7.22 17.14 11.17 ligra-bfs 94.18 95.85 95.90 91.23 3.48 15.76 8.00 ligra-bfsbv 39.31 47.36 50.74 29.46 3.10 12.65 7.56 ligra-cc 98.03 98.17 98.16 95.89 3.11 11.11 6.17 ligra-mis 97.35 98.28 98.36 96.16 5.62 16.29 11.10 ligra-radii 95.97 98.17 98.19 95.75 3.62 11.00 7.03 ligra-tc 10.83 15.99 17.02 7.52 1.59 3.55 3.02

slide-78
SLIDE 78

NOC TRAFFIC: BIG.TINY/HCC VS. BIG.TINY/MESI

Page 25 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

Big cores always use MESI, tiny cores use:

  • dnv = DeNovo
  • gwt = GPU-WT
  • gwb = GPU-WB
  • HCC configurations increase network traffic due to invalidations and

flushes

  • DTS can reduce network traffic, therefore reduce energy
  • HCC+DTS achieves similar energy with big.TINY/MESI
slide-79
SLIDE 79

NOC TRAFFIC: BIG.TINY/HCC VS. BIG.TINY/MESI

Page 25 of 26

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation

Big cores always use MESI, tiny cores use:

  • dnv = DeNovo
  • gwt = GPU-WT
  • gwb = GPU-WB
  • HCC configurations increase network traffic due to invalidations and

flushes

  • DTS can reduce network traffic, therefore reduce energy
  • HCC+DTS achieves similar energy with big.TINY/MESI
slide-80
SLIDE 80

TAKE-AWAY POINTS

Page 26 of 26

  • We present a work-stealing runtime for HCC

systems:

  • Provides a Cilk/TBB-like programming

model

  • Enables cooperative execution between

big and tiny cores

  • DTS improves performance and energy

efficiency

  • Using DTS, HCC systems achieve better

performance and similar energy efficiency compared to full-system hardware-based cache coherence

tiny

L1s

DIR

L2

DIR

L2 L1s

R

L1h L1s MC

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s

DIR

L2 L1s L1s L1s

DIR

L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC

R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

... ... ... ... ... ... ...

tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny

This work was supported in part by the Center for Applications Driving Architectures (ADA), one of six centers of JUMP, a Semiconductor Research Corporation program cosponsored by DARPA, and equipment donations from Intel.

Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation