Rigel: An Architecture and Scalable Programming Interface for a - PowerPoint PPT Presentation

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator By: Aahlad Chandrabhatta Varsheeth Talluri

Outline Motivation ● Desirables ● Chip Organization ● Five Design Elements & Implementation ● Benchmarks ● Evaluation ● Conclusion ● Discussion Points ● *All images in the presentation are taken from the original paper 2

Motivation Accelerators ● Maximize throughput (throughput/area & throughput/watt) ○ Domain specific (limited programmability) ○ Special purpose memory hierarchies & functional units ○ General purpose processors ● Attempt to maximize latency ○ Generic (Extensive programmability) ○ 3

Motivation Issue? While restricting the programming model yields high performance for data-parallel applications that have regular computation and memory access patterns, it presents a difficult target for applications that are less regular. i.e., We need programmable accelerators. 4

Desirables A programmable accelerator that provides performance through large-scale parallel execution ● Reduce semantic gap between LPI and traditional programming languages ● LPI needs to include primitive operations for expressing and managing parallelism ● LPI should also provide an effective way to exploit the accelerator’s compute throughput ● 5

Chip Organization Objective: Support high throughput while not compromising on programming model. ● Core - single precision FP unit ● Cluster - group of 8 cores ● Cluster Cache - common cache for all cores on cluster ● Tile - group of 16 clusters (128 cores) ● Global Cache bank - connected to all tiles ● With 45nm technology, can fit 1024 cores onto a 320 mm 2 chip 6

Design Elements 1. Execution Model 2. Memory Model 3. Work Distribution 4. Coherency 5. Locality Management 7

Design Elements - (1/5) - Execution Model SPMD - Because SIMD imposes undue optimization costs for many irregular applications. ● RISC - Because the goal is to make an efficient accelerator with small ISA. ● BSP - Bulk Synchronous Parallel Execution Model ● Execute parallel jobs (tasks) ○ Communication between jobs ○ Existence of Logical (Memory) Barriers - A memory barrier forces all outstanding memory ○ operations from a cluster to complete before allowing any memory operations after the barrier to begin. Execution model is implemented using Task Queues . ● 8

Execution Model - Visual Structure of Model General BSP Model Rigel’s BSP Model 9

Execution Model - Queue Management Instructions Four main operations on Task queues that are supported: TQ_CREATE - Creates a new Task Queue. ● TQ_ENQUEUE_GROUP - Enqueues a group of tasks to the Task Queue. ● TQ_DEQUEUE - Dequeues one task from the Task Queue. ● TQ_ENQUEUE - Enqueues one task to the Task Queue. ● The interface also provides atomic primitives to handle the above operations without any race conditions (when enqueuing or dequeuing) 10

Design Elements - (2/5) - Memory Model Single global address space - All the cores in Rigel processor share single address space . ● Hierarchical memory model ● Every cluster has a local cluster cache for local operations. ○ All the cores (and clusters and tiles) share a global cache for global operations. ○ 11

Design Elements - (3/5) - Work Distribution Key Concept - Task Queues - All the tasks (locally or global) that have to be handled are placed in task queues. Key Concept - Task Group - A set of tasks that execute on a single Rigel cluster Hierarchical task queues - Global and Local task queues . ● Parallel regions are divided into parallel tasks by the programmer and task groups are formed. ● LPI provides mechanisms for distributing tasks across parallel resources to minimize overhead. ● 12

Design Elements - (4/5) - Coherency Local coherency (within cluster) achieved by having Cluster Cache. ● Global coherency (for global operations) achieved by having Global Cache. ● Coherency across clusters (for read-write operations) achieved through software enforced ● solutions: Storing and Reading the shared data from global cache each time, instead of local cache. ○ Force writer to explicitly flush shared data before allowing read access (so that global cache gets updated). ○ Provide instructions for broadcast invalidation and broadcast update operations. ○ Just to note that, both these solutions are expensive as it involves global cache. ○ Ordering between local and global operations in a single core can be enforced by using explicit ● memory barrier operations. Key Concept revisited - Logical (Memory) Barriers - A memory barrier forces all outstanding memory operations from a cluster to complete before allowing any memory operations after the barrier to begin. 13

Coherency - Visual Structure of Model 14

Design Elements - (5/5) - Locality Management Co-location of tasks onto processing resources to increase local data sharing, reduce latency, frequency of communication and synchronization amongst co-located tasks. Implicitly handled by hardware-managed caches to exploit temporal and spatial locality. ● Explicitly handled (by programmers) by supporting cache management instructions. ● 15

Benchmarks Conjugate Gradient Linear Solver (cg) ● Gilbert-Johnson-Keerthi Collision Detection (gjk) ● Heat Transfer Simulation (heat) ● K Means Clustering (kmeans) ● Dense Matrix Multiplication (dmm) ● Medical Image Reconstruction Kernal (mri) ● Only justifications about using cg, heat and mri were described in the paper. 16

Benchmarks Number of clusters vs Speedup over 1 cluster system 17

Benchmarks - Conjugate Gradient Linear Solver Description : Algorithm uses a sparse-matrix vector multiply constituting 85% of the sequential execution time. Each element in the large, read-only data array is accessed only once per iteration while performing the SMVM Motivation for choice of this benchmark : Vectors generated each iteration are shared by cores within a cluster. ● Vector modifications each iteration are exchanged through the global cache. ● Prefetch op in Rigel ISA allows for data to bypass the global cache thus avoiding polluting ● touch-once data not shared across clusters. Rigel achieves good enqueue efficiency to satisfy high task input rates. ● 18

Benchmarks - K Means Clustering Description : Implements the K Means clustering Machine Learning algorithm where n-dimensional vectors are segregated into K bins such that the aggregate is minimum. Motivation for choice of this benchmark : Performs efficient atomic operations at global caches instead of global reduction at the end of the ● parallel sections. Due to the high arithmetic intensity of the benchmark and high reuse in cluster caches, the ● increased global cache traffic does not adversely impact performance. 19

Benchmarks - Dense Matrix Multiply Description : It has a very regular data access pattern with high arithmetic intensity. Motivation for choice of this benchmark : Exploits Rigel’s ability to maximize effective use of location management, prefetching, cluster ● cache management, global cache staging and added synchronization. Exploits Rigel’s ability to support applications amenable to static partitioning. ● 20

Evaluation Rigel vs GPU comparison 21

Conclusion Rigel can achieve a compute density of over 8 single-precision GFLOPS/mm2 in 45nm with a more ● flexible programming interface compared to conventional accelerators Find that it is important to support fast task enqueue and dequeue operations and barriers, and ● both can be implemented with a minimalist approach to specialized hardware. 22

Discussion Points 23

Discussion Points We need to use the global cache every time for inter cluster communication, but know that these ● operations are very expensive. Is it justifiable to call the accelerator mentioned in the paper a programmable accelerator if it can’t efficiently handle workloads which involve such inter cluster operations? 24

Discussion Points We need to use the global cache every time for inter cluster communication, but know that these ● operations are very expensive. Is it justifiable to call the accelerator mentioned in the paper a programmable accelerator if it can’t efficiently handle workloads which involve such inter cluster operations? In the paper, authors haven’t mentioned anything about optimal barrier placement. From a ● programmer’s perspective, would it increase the complexity by needing to come up with the barrier placement in such a heavy parallel processing environment? 25

Discussion Points We need to use the global cache every time for inter cluster communication, but know that these ● operations are very expensive. Is it justifiable to call the accelerator mentioned in the paper a programmable accelerator if it can’t efficiently handle workloads which involve such inter cluster operations? In the paper, authors haven’t mentioned anything about optimal barrier placement. From a ● programmer’s perspective, would it increase the complexity by needing to come up with the barrier placement in such a heavy parallel processing environment? Is it worth the effort to build programmable accelerators useful for a variety of parallelizable ● applications over specific ASICs for every domain? 26

End 27

Rigel: An Architecture and Scalable Programming Interface for a - PowerPoint PPT Presentation

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator By: Aahlad Chandrabhatta Varsheeth Talluri Outline Motivation Desirables Chip Organization Five Design Elements & Implementation

I/O Bus and Interface Data Bus Addr Bus CPU Control Interface Interface Interface Interface

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Interface Aesthetics Week 10 Print Media Interface Aesthetics 04/07/08 OUTLINE - Print media -

WSO2 Message Broker Scalable persistent Messaging System Outline Messaging Scalable

Stout An Adaptive Interface to Scalable Cloud Storage John Dunagan John C. McCullough Alec

BEYOND FLUX BEYOND FLUX SCALABLE FRONTEND ARCHITECTURES SCALABLE FRONTEND ARCHITECTURES USING

WatchKit Segues Segues Transition to another interface controller Push segues and modal segues

TDDE18 & 726G77 Interface, command line and vector interface An interface is an abstract

Interface Documents David Christian 11/20/17 Interface between CE and DAQ Interface

Dual Interface Technology Update EuroForum 2014 Munich Agenda 1/ Dual Interface Technologies

Linux Kernel Crypto API Herbert Xu Red Hat Inc. Current State Async + sync cipher interface.

User Interface Design User Interface Design Designing effective Designing effective interfaces

Dyninst Scalable Tools Workshop Granlibakken Resort Lake Tahoe, California Dyninst Scalable

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

What is Computer Architecture? Structure: static arrangement of the parts Organization:

Computer-System Architecture Operating System Concepts 2.2 Silberschatz, Galvin and Gagne

Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads Abanti Basak ,

Tools for Memory Performance Analysis Goals Expose the hierarchy Show the placement and

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Exam 1 Review Exam 1 Review February

Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies Hsin-Jung Yang

CS6630: Realistic Image Synthesis Prof. Steve Marschner Spring 2012 40 Spring Joint Computer

Picking Week 10, Mon Mar 14 http://www.ugrad.cs.ubc.ca/~cs314/Vjan2005 News some people

Rigel: An Architecture and Scalable Programming Interface for a - PowerPoint PPT Presentation

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator By: Aahlad Chandrabhatta Varsheeth Talluri Outline Motivation Desirables Chip Organization Five Design Elements & Implementation

I/O Bus and Interface Data Bus Addr Bus CPU Control Interface Interface Interface Interface

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Interface Aesthetics Week 10 Print Media Interface Aesthetics 04/07/08 OUTLINE - Print media -

WSO2 Message Broker Scalable persistent Messaging System Outline Messaging Scalable

Stout An Adaptive Interface to Scalable Cloud Storage John Dunagan John C. McCullough Alec

BEYOND FLUX BEYOND FLUX SCALABLE FRONTEND ARCHITECTURES SCALABLE FRONTEND ARCHITECTURES USING

WatchKit Segues Segues Transition to another interface controller Push segues and modal segues

TDDE18 &amp; 726G77 Interface, command line and vector interface An interface is an abstract

Interface Documents David Christian 11/20/17 Interface between CE and DAQ Interface

Dual Interface Technology Update EuroForum 2014 Munich Agenda 1/ Dual Interface Technologies

Linux Kernel Crypto API Herbert Xu Red Hat Inc. Current State Async + sync cipher interface.

User Interface Design User Interface Design Designing effective Designing effective interfaces

Dyninst Scalable Tools Workshop Granlibakken Resort Lake Tahoe, California Dyninst Scalable

The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors Austin T.

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

What is Computer Architecture? Structure: static arrangement of the parts Organization:

Computer-System Architecture Operating System Concepts 2.2 Silberschatz, Galvin and Gagne

Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads Abanti Basak ,

Tools for Memory Performance Analysis Goals Expose the hierarchy Show the placement and

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Exam 1 Review Exam 1 Review February

Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies Hsin-Jung Yang

CS6630: Realistic Image Synthesis Prof. Steve Marschner Spring 2012 40 Spring Joint Computer

Picking Week 10, Mon Mar 14 http://www.ugrad.cs.ubc.ca/~cs314/Vjan2005 News some people

TDDE18 & 726G77 Interface, command line and vector interface An interface is an abstract