EFFICIENTLY SUPPORTING DYNAMIC TASK PARALLELISM ON HETEROGENEOUS CACHE- COHERENT SYSTEMS
Moyang Wang, Tuan Ta, Lin Cheng, Christopher Batten
Page 1 of 26
P ARALLELISM ON H ETEROGENEOUS C ACHE - C OHERENT S YSTEMS Moyang - - PowerPoint PPT Presentation
E FFICIENTLY S UPPORTING D YNAMIC T ASK P ARALLELISM ON H ETEROGENEOUS C ACHE - C OHERENT S YSTEMS Moyang Wang, Tuan Ta, Lin Cheng, Christopher Batten Computer Systems Laboratory Cornell University Page 1 of 26 M ANYCORE P ROCESSORS Motivation
Page 1 of 26
Page 1 of 26
Hardware-Based Cache Coherence Software-Centric Cache Coherence / No Coherence Tilera TILE64 64 Cores Intel Xeon Phi 72 Cores Cavium ThunderX 48 Cores NVIDIA GV100 GPU 72 SM Celerity 511 Cores Adapteva Epiphany 1024 Cores KiloCore 1000 Cores
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 2 of 22
int fib( int n ) { if ( n < 2 ) return n; int x, y; tbb::parallel_invoke( [&] { x = fib( n - 1 ); }, [&] { y = fib( n - 2 ); } ); return (x + y); } int fib( int n ) { if ( n < 2 ) return n; int x = cilk_spawn fib( n - 1 ); int y = fib( n - 2 ); cilk_sync; return (x + y); }
Intel TBB Intel Cilk Plus
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 2 of 22
int fib( int n ) { if ( n < 2 ) return n; int x, y; tbb::parallel_invoke( [&] { x = fib( n - 1 ); }, [&] { y = fib( n - 2 ); } ); return (x + y); } int fib( int n ) { if ( n < 2 ) return n; int x = cilk_spawn fib( n - 1 ); int y = fib( n - 2 ); cilk_sync; return (x + y); }
Intel TBB Intel Cilk Plus
Host Processor
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 3 of 26
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 4 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 5 of 26
Stale Data Invalidation Dirty Data Propagation Write Granularity MESI Writer Owner, Write-Back Cache Line DeNovo Reader Owner, Write-Back Flexible GPU-WT Reader No-Owner, Write-Through Word GPU-WB Reader No-Owner, Write-Back Word
[1] H. Sung and S. V. Adve. DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations. ASPLOS 2015. [2] J. Alsop, M. Sinclair, and S. V. Adve. Spandex: A Flexible Interface for Efficient Heterogeneous Coherence. ISCA 2018.
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 6 of 26
Pictures from Robinson et al., Structured Parallel Programming: Patterns for Efficient Computation, 2012
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 6 of 26
long fib( int n ) { if ( n < 2 ) return n; long x, y; parallel_invoke( [&] { x = fib( n - 1 ); }, [&] { y = fib( n - 2 ); } ); return (x + y); } void vvadd( int a[], int b[], int dst[], int n ) { parallel_for( 0, n, [&]( int i ) { dst[i] = a[i] + b[i]; }); }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 6 of 26
class FibTask : public task { int n, *sum; void execute() { if ( n < 2 ) { *sum = n; return; } long x, y; FibTask a( n - 1, &x ); FibTask b( n - 2, &y ); this->reference_count = 2; task::spawn( &a ); task::spawn( &b ); task::wait( this ); *sum = x + y; } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 7 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 7 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 7 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3
Page 7 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task A
Page 7 of 26
A
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task B Task A Spawn Task B
Page 7 of 26
A B
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task B Dequeue Task B
Page 7 of 26
A B
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task B Spawn Task C
Page 7 of 26
A B C
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task B Spawn Task D Task D Task C
Page 7 of 26
A B C D
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task D Steal Task D Steal Task C Task B
Page 7 of 26
A B C D
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task D Task E Task F Spawn Task F Spawn Task E
Page 7 of 26
A B C D F E
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task D Task E Task F Steal Task E Steal Task F
Page 7 of 26
A B C D F E
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 8 of 22
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 9 of 22
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
[1] R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, and K. H.
Page 10 of 22
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 10 of 22
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 10 of 22
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 10 of 22
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 11 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 11 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Task
Page 11 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Task
Page 11 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Task Task
Page 11 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Task Task
Page 11 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Task Task Task
Page 11 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Task Task Task
Page 11 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Task Task Task Task
Page 11 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Task Task Task Task
Page 11 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Task Task Task Task Task
Page 11 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Task Task Task Task Task
Page 11 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Task Task Task Task Task Task Task Task Task
Page 12 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 13 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 14 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate(); cache_flush(); cache_invalidate();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 14 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
cache_invalidate(); cache_flush(); cache_invalidate();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 14 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
cache_invalidate(); cache_flush();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 14 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
cache_invalidate(); cache_flush();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 14 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task_queue[tid].lock_acquire(); task* t = task_queue[tid].dequeue(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub( t->parent->ref_count, 1 ); } else { int vid = choose_victim(); task_queue[tid].lock_acquire(); t = task_queue[vid].steal(); task_queue[tid].lock_release(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } }
cache_invalidate(); cache_flush();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 15 of 26
[1] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. ISCA 1992. [2] D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible Architectural Support for Fine-Grain Scheduling. ASPLOS 2010.
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 15 of 26
[1] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. ISCA 1992. [2] D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible Architectural Support for Fine-Grain Scheduling. ASPLOS 2010.
Send Interrupt
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 15 of 26
[1] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. ISCA 1992. [2] D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible Architectural Support for Fine-Grain Scheduling. ASPLOS 2010.
Send Interrupt Jump to a handler
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 15 of 26
[1] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. ISCA 1992. [2] D. Sanchez, R. M. Yoo, and C. Kozyrakis. Flexible Architectural Support for Fine-Grain Scheduling. ASPLOS 2010.
Send Interrupt Jump to a handler Send an ACK/NACK
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 16 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
Parent Task
Victim Thief Work In-Progress Task Queues (Private) Shared LLC
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 16 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
Parent Task
Victim Thief Work In-Progress Task Queues (Private) Shared LLC
Child Task
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 16 of 26
Send Interrupt
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
Parent Task
Victim Thief Work In-Progress Task Queues (Private) Shared LLC
Child Task
ULI with id
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 16 of 26
Send Interrupt Jump to a handler
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
Parent Task
Victim Thief Work In-Progress Task Queues (Private) Shared LLC
Child Task
ULI with id
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 16 of 26
Send Interrupt Jump to a handler
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
Parent Task
Victim Thief Work In-Progress Task Queues (Private) Shared LLC
Child Task
Flush ULI with id
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 16 of 26
Send Interrupt Jump to a handler
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
Parent Task
Victim Thief Work In-Progress Task Queues (Private) Shared LLC
Child Task
Flush ULI with id
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 16 of 26
Send Interrupt Jump to a handler
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
Parent Task
Victim Thief Work In-Progress Task Queues (Private) Shared LLC
Child Task
Flush
ULI with id
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 16 of 26
Send Interrupt Jump to a handler Send an ACK/NACK
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
Parent Task
Victim Thief Work In-Progress Task Queues (Private) Shared LLC
Child Task
Flush
ULI with id ACK
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 16 of 26
Send Interrupt Jump to a handler Send an ACK/NACK
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
Parent Task
Victim Thief Work In-Progress Task Queues (Private) Shared LLC
Child Task
Flush
ULI with id ACK
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 16 of 26
Send Interrupt Jump to a handler Send an ACK/NACK
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
Parent Task
Victim Thief Work In-Progress Task Queues (Private) Shared LLC
Child Task
Flush
ULI with id ACK
Page 17 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task* t = task_queue[tid].dequeue(); if (t) { t->execute(); if (t->parent->child_stolen) amo_sub( t->parent->ref_count, 1 ); else t->parent->ref_count -= 1; } else { t = steal_using_dts(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } if ( p->has_stolen_child ) cache_invalidate(); } cache_invalidate(); cache_flush();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 17 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task* t = task_queue[tid].dequeue(); if (t) { t->execute(); if (t->parent->child_stolen) amo_sub( t->parent->ref_count, 1 ); else t->parent->ref_count -= 1; } else { t = steal_using_dts(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } if ( p->has_stolen_child ) cache_invalidate(); } cache_invalidate(); cache_flush();
uli_disable(); uli_enable();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 17 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task* t = task_queue[tid].dequeue(); if (t) { t->execute(); if (t->parent->child_stolen) amo_sub( t->parent->ref_count, 1 ); else t->parent->ref_count -= 1; } else { t = steal_using_dts(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } if ( p->has_stolen_child ) cache_invalidate(); } cache_invalidate(); cache_flush();
uli_disable(); uli_enable();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 17 of 26
void task::wait( task* p ) { while ( p->ref_count > 0 ) { task* t = task_queue[tid].dequeue(); if (t) { t->execute(); if (t->parent->child_stolen) amo_sub( t->parent->ref_count, 1 ); else t->parent->ref_count -= 1; } else { t = steal_using_dts(); if (t) { t->execute(); amo_sub(t->parent->ref_count, 1 ); } } } if ( p->has_stolen_child ) cache_invalidate(); } cache_invalidate(); cache_flush();
uli_disable(); uli_enable();
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 18 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 19 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
A big.TINY architecture combines a few big OOO cores with many tiny IO cores on a single die using heterogeneous cache coherence
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
Page 20 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 21 of 26
Name Input cilk5-cs 3000000 cilk5-lu 128 cilk5-mm 256 cilk5-mt 8000 cilk5-nq 10 ligra-bc rMat_100K ligra-bf rMat_200K ligra-bfs rMat_800K ligra-bfsbv rMat_500K ligra-cc rMat_500K ligra-mis rMat_100K ligra-radii rMat_200K ligra-tc rMat_200K geomean Speedup over Serial IO b.T/ O3×1 O3×4 O3×8 MESI 1.65 4.92 9.78 18.70 2.48 9.46 17.24 23.93 11.38 11.76 22.04 41.23 5.71 19.70 39.94 57.43 1.57 3.87 7.03 2.93 2.05 6.29 13.06 11.48 1.80 5.36 11.25 12.80 2.23 6.23 12.70 15.63 1.91 6.17 12.25 14.42 3.00 9.11 20.66 24.12 2.43 7.70 15.61 19.01 2.80 8.17 17.89 25.94 1.49 4.99 10.89 23.21 2.56 7.26 14.70 16.94
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 21 of 26
Name Input cilk5-cs 3000000 cilk5-lu 128 cilk5-mm 256 cilk5-mt 8000 cilk5-nq 10 ligra-bc rMat_100K ligra-bf rMat_200K ligra-bfs rMat_800K ligra-bfsbv rMat_500K ligra-cc rMat_500K ligra-mis rMat_100K ligra-radii rMat_200K ligra-tc rMat_200K geomean Speedup over Serial IO b.T/ O3×1 O3×4 O3×8 MESI 1.65 4.92 9.78 18.70 2.48 9.46 17.24 23.93 11.38 11.76 22.04 41.23 5.71 19.70 39.94 57.43 1.57 3.87 7.03 2.93 2.05 6.29 13.06 11.48 1.80 5.36 11.25 12.80 2.23 6.23 12.70 15.63 1.91 6.17 12.25 14.42 3.00 9.11 20.66 24.12 2.43 7.70 15.61 19.01 2.80 8.17 17.89 25.94 1.49 4.99 10.89 23.21 2.56 7.26 14.70 16.94
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Page 22 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Big cores always use MESI, tiny cores use:
big.TINY/MESI
Page 23 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Big cores always use MESI, tiny cores use:
Page 23 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Big cores always use MESI, tiny cores use:
Page 24 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Invalidation Decrease (%) Flush Decrease (%) Hit Rate Increase (%) App dnv gwt gwb gwb dnv gwt gwb cilk5-cs 99.42 99.28 99.50 98.86 1.80 2.45 1.30 cilk5-lu 98.83 99.78 99.53 98.40 1.12 7.12 2.94 cilk5-mm 99.22 99.67 99.62 99.12 30.03 42.19 36.80 cilk5-mt 99.88 99.73 99.93 99.82 12.45 2.70 6.56 cilk5-nq 97.74 97.88 98.32 95.84 16.84 28.87 27.04 ligra-bc 94.89 97.04 97.33 93.80 7.64 21.43 14.99 ligra-bf 29.02 38.14 40.24 21.63 7.22 17.14 11.17 ligra-bfs 94.18 95.85 95.90 91.23 3.48 15.76 8.00 ligra-bfsbv 39.31 47.36 50.74 29.46 3.10 12.65 7.56 ligra-cc 98.03 98.17 98.16 95.89 3.11 11.11 6.17 ligra-mis 97.35 98.28 98.36 96.16 5.62 16.29 11.10 ligra-radii 95.97 98.17 98.19 95.75 3.62 11.00 7.03 ligra-tc 10.83 15.99 17.02 7.52 1.59 3.55 3.02
Page 24 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Invalidation Decrease (%) Flush Decrease (%) Hit Rate Increase (%) App dnv gwt gwb gwb dnv gwt gwb cilk5-cs 99.42 99.28 99.50 98.86 1.80 2.45 1.30 cilk5-lu 98.83 99.78 99.53 98.40 1.12 7.12 2.94 cilk5-mm 99.22 99.67 99.62 99.12 30.03 42.19 36.80 cilk5-mt 99.88 99.73 99.93 99.82 12.45 2.70 6.56 cilk5-nq 97.74 97.88 98.32 95.84 16.84 28.87 27.04 ligra-bc 94.89 97.04 97.33 93.80 7.64 21.43 14.99 ligra-bf 29.02 38.14 40.24 21.63 7.22 17.14 11.17 ligra-bfs 94.18 95.85 95.90 91.23 3.48 15.76 8.00 ligra-bfsbv 39.31 47.36 50.74 29.46 3.10 12.65 7.56 ligra-cc 98.03 98.17 98.16 95.89 3.11 11.11 6.17 ligra-mis 97.35 98.28 98.36 96.16 5.62 16.29 11.10 ligra-radii 95.97 98.17 98.19 95.75 3.62 11.00 7.03 ligra-tc 10.83 15.99 17.02 7.52 1.59 3.55 3.02
Page 24 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Invalidation Decrease (%) Flush Decrease (%) Hit Rate Increase (%) App dnv gwt gwb gwb dnv gwt gwb cilk5-cs 99.42 99.28 99.50 98.86 1.80 2.45 1.30 cilk5-lu 98.83 99.78 99.53 98.40 1.12 7.12 2.94 cilk5-mm 99.22 99.67 99.62 99.12 30.03 42.19 36.80 cilk5-mt 99.88 99.73 99.93 99.82 12.45 2.70 6.56 cilk5-nq 97.74 97.88 98.32 95.84 16.84 28.87 27.04 ligra-bc 94.89 97.04 97.33 93.80 7.64 21.43 14.99 ligra-bf 29.02 38.14 40.24 21.63 7.22 17.14 11.17 ligra-bfs 94.18 95.85 95.90 91.23 3.48 15.76 8.00 ligra-bfsbv 39.31 47.36 50.74 29.46 3.10 12.65 7.56 ligra-cc 98.03 98.17 98.16 95.89 3.11 11.11 6.17 ligra-mis 97.35 98.28 98.36 96.16 5.62 16.29 11.10 ligra-radii 95.97 98.17 98.19 95.75 3.62 11.00 7.03 ligra-tc 10.83 15.99 17.02 7.52 1.59 3.55 3.02
Page 24 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Invalidation Decrease (%) Flush Decrease (%) Hit Rate Increase (%) App dnv gwt gwb gwb dnv gwt gwb cilk5-cs 99.42 99.28 99.50 98.86 1.80 2.45 1.30 cilk5-lu 98.83 99.78 99.53 98.40 1.12 7.12 2.94 cilk5-mm 99.22 99.67 99.62 99.12 30.03 42.19 36.80 cilk5-mt 99.88 99.73 99.93 99.82 12.45 2.70 6.56 cilk5-nq 97.74 97.88 98.32 95.84 16.84 28.87 27.04 ligra-bc 94.89 97.04 97.33 93.80 7.64 21.43 14.99 ligra-bf 29.02 38.14 40.24 21.63 7.22 17.14 11.17 ligra-bfs 94.18 95.85 95.90 91.23 3.48 15.76 8.00 ligra-bfsbv 39.31 47.36 50.74 29.46 3.10 12.65 7.56 ligra-cc 98.03 98.17 98.16 95.89 3.11 11.11 6.17 ligra-mis 97.35 98.28 98.36 96.16 5.62 16.29 11.10 ligra-radii 95.97 98.17 98.19 95.75 3.62 11.00 7.03 ligra-tc 10.83 15.99 17.02 7.52 1.59 3.55 3.02
Page 25 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Big cores always use MESI, tiny cores use:
flushes
Page 25 of 26
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation
Big cores always use MESI, tiny cores use:
flushes
Page 26 of 26
tiny
L1s
DIR
L2
DIR
L2 L1s
R
L1h L1s MC
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s
DIR
L2 L1s L1s L1s
DIR
L2 L1s L1h L1s L1s L1s MC MC MC MC MC MC MC
R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
... ... ... ... ... ... ...
tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny tiny big tiny big tiny big tiny big tiny
This work was supported in part by the Center for Applications Driving Architectures (ADA), one of six centers of JUMP, a Semiconductor Research Corporation program cosponsored by DARPA, and equipment donations from Intel.
Motivation • Background • Implementing Work-Stealing on HCC • DTS • Evaluation