Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator
By: Aahlad Chandrabhatta Varsheeth Talluri
Rigel: An Architecture and Scalable Programming Interface for a - - PowerPoint PPT Presentation
Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator By: Aahlad Chandrabhatta Varsheeth Talluri Outline Motivation Desirables Chip Organization Five Design Elements & Implementation
By: Aahlad Chandrabhatta Varsheeth Talluri
2
*All images in the presentation are taken from the original paper
○ Maximize throughput (throughput/area & throughput/watt) ○ Domain specific (limited programmability) ○ Special purpose memory hierarchies & functional units
○ Attempt to maximize latency ○ Generic (Extensive programmability)
3
Issue? While restricting the programming model yields high performance for data-parallel applications that have regular computation and memory access patterns, it presents a difficult target for applications that are less regular. i.e., We need programmable accelerators.
4
5
Objective: Support high throughput while not compromising on programming model.
cores on cluster
tiles
cores onto a 320 mm2 chip
6
1. Execution Model 2. Memory Model 3. Work Distribution 4. Coherency 5. Locality Management
7
○ Execute parallel jobs (tasks) ○ Communication between jobs ○ Existence of Logical (Memory) Barriers - A memory barrier forces all outstanding memory
barrier to begin.
8
9
General BSP Model Rigel’s BSP Model
Four main operations on Task queues that are supported:
The interface also provides atomic primitives to handle the above operations without any race conditions (when enqueuing or dequeuing)
10
○ Every cluster has a local cluster cache for local operations. ○ All the cores (and clusters and tiles) share a global cache for global operations.
11
Key Concept - Task Queues - All the tasks (locally or global) that have to be handled are placed in task queues. Key Concept - Task Group - A set of tasks that execute on a single Rigel cluster
12
solutions:
○ Storing and Reading the shared data from global cache each time, instead of local cache. ○ Force writer to explicitly flush shared data before allowing read access (so that global cache gets updated).
○ Provide instructions for broadcast invalidation and broadcast update operations.
○ Just to note that, both these solutions are expensive as it involves global cache.
memory barrier operations. Key Concept revisited - Logical (Memory) Barriers - A memory barrier forces all outstanding memory
13
14
Co-location of tasks onto processing resources to increase local data sharing, reduce latency, frequency
15
Only justifications about using cg, heat and mri were described in the paper.
16
17
Number of clusters vs Speedup over 1 cluster system
Description: Algorithm uses a sparse-matrix vector multiply constituting 85% of the sequential execution time. Each element in the large, read-only data array is accessed only once per iteration while performing the SMVM Motivation for choice of this benchmark:
touch-once data not shared across clusters.
18
Description: Implements the K Means clustering Machine Learning algorithm where n-dimensional vectors are segregated into K bins such that the aggregate is minimum. Motivation for choice of this benchmark:
parallel sections.
increased global cache traffic does not adversely impact performance.
19
Description: It has a very regular data access pattern with high arithmetic intensity. Motivation for choice of this benchmark:
cache management, global cache staging and added synchronization.
20
21
Rigel vs GPU comparison
flexible programming interface compared to conventional accelerators
both can be implemented with a minimalist approach to specialized hardware.
22
23
programmable accelerator if it can’t efficiently handle workloads which involve such inter cluster
24
programmable accelerator if it can’t efficiently handle workloads which involve such inter cluster
programmer’s perspective, would it increase the complexity by needing to come up with the barrier placement in such a heavy parallel processing environment?
25
programmable accelerator if it can’t efficiently handle workloads which involve such inter cluster
programmer’s perspective, would it increase the complexity by needing to come up with the barrier placement in such a heavy parallel processing environment?
applications over specific ASICs for every domain?
26
27