rigel an architecture and scalable programming interface
play

Rigel: An Architecture and Scalable Programming Interface for a - PowerPoint PPT Presentation

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator By: Aahlad Chandrabhatta Varsheeth Talluri Outline Motivation Desirables Chip Organization Five Design Elements & Implementation


  1. Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator By: Aahlad Chandrabhatta Varsheeth Talluri

  2. Outline Motivation ● Desirables ● Chip Organization ● Five Design Elements & Implementation ● Benchmarks ● Evaluation ● Conclusion ● Discussion Points ● *All images in the presentation are taken from the original paper 2

  3. Motivation Accelerators ● Maximize throughput (throughput/area & throughput/watt) ○ Domain specific (limited programmability) ○ Special purpose memory hierarchies & functional units ○ General purpose processors ● Attempt to maximize latency ○ Generic (Extensive programmability) ○ 3

  4. Motivation Issue? While restricting the programming model yields high performance for data-parallel applications that have regular computation and memory access patterns, it presents a difficult target for applications that are less regular. i.e., We need programmable accelerators. 4

  5. Desirables A programmable accelerator that provides performance through large-scale parallel execution ● Reduce semantic gap between LPI and traditional programming languages ● LPI needs to include primitive operations for expressing and managing parallelism ● LPI should also provide an effective way to exploit the accelerator’s compute throughput ● 5

  6. Chip Organization Objective: Support high throughput while not compromising on programming model. ● Core - single precision FP unit ● Cluster - group of 8 cores ● Cluster Cache - common cache for all cores on cluster ● Tile - group of 16 clusters (128 cores) ● Global Cache bank - connected to all tiles ● With 45nm technology, can fit 1024 cores onto a 320 mm 2 chip 6

  7. Design Elements 1. Execution Model 2. Memory Model 3. Work Distribution 4. Coherency 5. Locality Management 7

  8. Design Elements - (1/5) - Execution Model SPMD - Because SIMD imposes undue optimization costs for many irregular applications. ● RISC - Because the goal is to make an efficient accelerator with small ISA. ● BSP - Bulk Synchronous Parallel Execution Model ● Execute parallel jobs (tasks) ○ Communication between jobs ○ Existence of Logical (Memory) Barriers - A memory barrier forces all outstanding memory ○ operations from a cluster to complete before allowing any memory operations after the barrier to begin. Execution model is implemented using Task Queues . ● 8

  9. Execution Model - Visual Structure of Model General BSP Model Rigel’s BSP Model 9

  10. Execution Model - Queue Management Instructions Four main operations on Task queues that are supported: TQ_CREATE - Creates a new Task Queue. ● TQ_ENQUEUE_GROUP - Enqueues a group of tasks to the Task Queue. ● TQ_DEQUEUE - Dequeues one task from the Task Queue. ● TQ_ENQUEUE - Enqueues one task to the Task Queue. ● The interface also provides atomic primitives to handle the above operations without any race conditions (when enqueuing or dequeuing) 10

  11. Design Elements - (2/5) - Memory Model Single global address space - All the cores in Rigel processor share single address space . ● Hierarchical memory model ● Every cluster has a local cluster cache for local operations. ○ All the cores (and clusters and tiles) share a global cache for global operations. ○ 11

  12. Design Elements - (3/5) - Work Distribution Key Concept - Task Queues - All the tasks (locally or global) that have to be handled are placed in task queues. Key Concept - Task Group - A set of tasks that execute on a single Rigel cluster Hierarchical task queues - Global and Local task queues . ● Parallel regions are divided into parallel tasks by the programmer and task groups are formed. ● LPI provides mechanisms for distributing tasks across parallel resources to minimize overhead. ● 12

  13. Design Elements - (4/5) - Coherency Local coherency (within cluster) achieved by having Cluster Cache. ● Global coherency (for global operations) achieved by having Global Cache. ● Coherency across clusters (for read-write operations) achieved through software enforced ● solutions: Storing and Reading the shared data from global cache each time, instead of local cache. ○ Force writer to explicitly flush shared data before allowing read access (so that global cache gets updated). ○ Provide instructions for broadcast invalidation and broadcast update operations. ○ Just to note that, both these solutions are expensive as it involves global cache. ○ Ordering between local and global operations in a single core can be enforced by using explicit ● memory barrier operations. Key Concept revisited - Logical (Memory) Barriers - A memory barrier forces all outstanding memory operations from a cluster to complete before allowing any memory operations after the barrier to begin. 13

  14. Coherency - Visual Structure of Model 14

  15. Design Elements - (5/5) - Locality Management Co-location of tasks onto processing resources to increase local data sharing, reduce latency, frequency of communication and synchronization amongst co-located tasks. Implicitly handled by hardware-managed caches to exploit temporal and spatial locality. ● Explicitly handled (by programmers) by supporting cache management instructions. ● 15

  16. Benchmarks Conjugate Gradient Linear Solver (cg) ● Gilbert-Johnson-Keerthi Collision Detection (gjk) ● Heat Transfer Simulation (heat) ● K Means Clustering (kmeans) ● Dense Matrix Multiplication (dmm) ● Medical Image Reconstruction Kernal (mri) ● Only justifications about using cg, heat and mri were described in the paper. 16

  17. Benchmarks Number of clusters vs Speedup over 1 cluster system 17

  18. Benchmarks - Conjugate Gradient Linear Solver Description : Algorithm uses a sparse-matrix vector multiply constituting 85% of the sequential execution time. Each element in the large, read-only data array is accessed only once per iteration while performing the SMVM Motivation for choice of this benchmark : Vectors generated each iteration are shared by cores within a cluster. ● Vector modifications each iteration are exchanged through the global cache. ● Prefetch op in Rigel ISA allows for data to bypass the global cache thus avoiding polluting ● touch-once data not shared across clusters. Rigel achieves good enqueue efficiency to satisfy high task input rates. ● 18

  19. Benchmarks - K Means Clustering Description : Implements the K Means clustering Machine Learning algorithm where n-dimensional vectors are segregated into K bins such that the aggregate is minimum. Motivation for choice of this benchmark : Performs efficient atomic operations at global caches instead of global reduction at the end of the ● parallel sections. Due to the high arithmetic intensity of the benchmark and high reuse in cluster caches, the ● increased global cache traffic does not adversely impact performance. 19

  20. Benchmarks - Dense Matrix Multiply Description : It has a very regular data access pattern with high arithmetic intensity. Motivation for choice of this benchmark : Exploits Rigel’s ability to maximize effective use of location management, prefetching, cluster ● cache management, global cache staging and added synchronization. Exploits Rigel’s ability to support applications amenable to static partitioning. ● 20

  21. Evaluation Rigel vs GPU comparison 21

  22. Conclusion Rigel can achieve a compute density of over 8 single-precision GFLOPS/mm2 in 45nm with a more ● flexible programming interface compared to conventional accelerators Find that it is important to support fast task enqueue and dequeue operations and barriers, and ● both can be implemented with a minimalist approach to specialized hardware. 22

  23. Discussion Points 23

  24. Discussion Points We need to use the global cache every time for inter cluster communication, but know that these ● operations are very expensive. Is it justifiable to call the accelerator mentioned in the paper a programmable accelerator if it can’t efficiently handle workloads which involve such inter cluster operations? 24

  25. Discussion Points We need to use the global cache every time for inter cluster communication, but know that these ● operations are very expensive. Is it justifiable to call the accelerator mentioned in the paper a programmable accelerator if it can’t efficiently handle workloads which involve such inter cluster operations? In the paper, authors haven’t mentioned anything about optimal barrier placement. From a ● programmer’s perspective, would it increase the complexity by needing to come up with the barrier placement in such a heavy parallel processing environment? 25

  26. Discussion Points We need to use the global cache every time for inter cluster communication, but know that these ● operations are very expensive. Is it justifiable to call the accelerator mentioned in the paper a programmable accelerator if it can’t efficiently handle workloads which involve such inter cluster operations? In the paper, authors haven’t mentioned anything about optimal barrier placement. From a ● programmer’s perspective, would it increase the complexity by needing to come up with the barrier placement in such a heavy parallel processing environment? Is it worth the effort to build programmable accelerators useful for a variety of parallelizable ● applications over specific ASICs for every domain? 26

  27. End 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend