Rigel: An Architecture and Scalable Programming Interface for a - - PowerPoint PPT Presentation

rigel an architecture and scalable programming interface
SMART_READER_LITE
LIVE PREVIEW

Rigel: An Architecture and Scalable Programming Interface for a - - PowerPoint PPT Presentation

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator By: Aahlad Chandrabhatta Varsheeth Talluri Outline Motivation Desirables Chip Organization Five Design Elements & Implementation


slide-1
SLIDE 1

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator

By: Aahlad Chandrabhatta Varsheeth Talluri

slide-2
SLIDE 2

Outline

  • Motivation
  • Desirables
  • Chip Organization
  • Five Design Elements & Implementation
  • Benchmarks
  • Evaluation
  • Conclusion
  • Discussion Points

2

*All images in the presentation are taken from the original paper

slide-3
SLIDE 3

Motivation

  • Accelerators

○ Maximize throughput (throughput/area & throughput/watt) ○ Domain specific (limited programmability) ○ Special purpose memory hierarchies & functional units

  • General purpose processors

○ Attempt to maximize latency ○ Generic (Extensive programmability)

3

slide-4
SLIDE 4

Motivation

Issue? While restricting the programming model yields high performance for data-parallel applications that have regular computation and memory access patterns, it presents a difficult target for applications that are less regular. i.e., We need programmable accelerators.

4

slide-5
SLIDE 5

Desirables

  • A programmable accelerator that provides performance through large-scale parallel execution
  • Reduce semantic gap between LPI and traditional programming languages
  • LPI needs to include primitive operations for expressing and managing parallelism
  • LPI should also provide an effective way to exploit the accelerator’s compute throughput

5

slide-6
SLIDE 6

Chip Organization

Objective: Support high throughput while not compromising on programming model.

  • Core - single precision FP unit
  • Cluster - group of 8 cores
  • Cluster Cache - common cache for all

cores on cluster

  • Tile - group of 16 clusters (128 cores)
  • Global Cache bank - connected to all

tiles

  • With 45nm technology, can fit 1024

cores onto a 320 mm2 chip

6

slide-7
SLIDE 7

Design Elements

1. Execution Model 2. Memory Model 3. Work Distribution 4. Coherency 5. Locality Management

7

slide-8
SLIDE 8

Design Elements - (1/5) - Execution Model

  • SPMD - Because SIMD imposes undue optimization costs for many irregular applications.
  • RISC - Because the goal is to make an efficient accelerator with small ISA.
  • BSP - Bulk Synchronous Parallel Execution Model

○ Execute parallel jobs (tasks) ○ Communication between jobs ○ Existence of Logical (Memory) Barriers - A memory barrier forces all outstanding memory

  • perations from a cluster to complete before allowing any memory operations after the

barrier to begin.

  • Execution model is implemented using Task Queues.

8

slide-9
SLIDE 9

Execution Model - Visual Structure of Model

9

General BSP Model Rigel’s BSP Model

slide-10
SLIDE 10

Execution Model - Queue Management Instructions

Four main operations on Task queues that are supported:

  • TQ_CREATE - Creates a new Task Queue.
  • TQ_ENQUEUE_GROUP - Enqueues a group of tasks to the Task Queue.
  • TQ_DEQUEUE - Dequeues one task from the Task Queue.
  • TQ_ENQUEUE - Enqueues one task to the Task Queue.

The interface also provides atomic primitives to handle the above operations without any race conditions (when enqueuing or dequeuing)

10

slide-11
SLIDE 11

Design Elements - (2/5) - Memory Model

  • Single global address space - All the cores in Rigel processor share single address space.
  • Hierarchical memory model

○ Every cluster has a local cluster cache for local operations. ○ All the cores (and clusters and tiles) share a global cache for global operations.

11

slide-12
SLIDE 12

Design Elements - (3/5) - Work Distribution

Key Concept - Task Queues - All the tasks (locally or global) that have to be handled are placed in task queues. Key Concept - Task Group - A set of tasks that execute on a single Rigel cluster

  • Hierarchical task queues - Global and Local task queues.
  • Parallel regions are divided into parallel tasks by the programmer and task groups are formed.
  • LPI provides mechanisms for distributing tasks across parallel resources to minimize overhead.

12

slide-13
SLIDE 13

Design Elements - (4/5) - Coherency

  • Local coherency (within cluster) achieved by having Cluster Cache.
  • Global coherency (for global operations) achieved by having Global Cache.
  • Coherency across clusters (for read-write operations) achieved through software enforced

solutions:

○ Storing and Reading the shared data from global cache each time, instead of local cache. ○ Force writer to explicitly flush shared data before allowing read access (so that global cache gets updated).

○ Provide instructions for broadcast invalidation and broadcast update operations.

○ Just to note that, both these solutions are expensive as it involves global cache.

  • Ordering between local and global operations in a single core can be enforced by using explicit

memory barrier operations. Key Concept revisited - Logical (Memory) Barriers - A memory barrier forces all outstanding memory

  • perations from a cluster to complete before allowing any memory operations after the barrier to begin.

13

slide-14
SLIDE 14

Coherency - Visual Structure of Model

14

slide-15
SLIDE 15

Design Elements - (5/5) - Locality Management

Co-location of tasks onto processing resources to increase local data sharing, reduce latency, frequency

  • f communication and synchronization amongst co-located tasks.
  • Implicitly handled by hardware-managed caches to exploit temporal and spatial locality.
  • Explicitly handled (by programmers) by supporting cache management instructions.

15

slide-16
SLIDE 16

Benchmarks

  • Conjugate Gradient Linear Solver (cg)
  • Gilbert-Johnson-Keerthi Collision Detection (gjk)
  • Heat Transfer Simulation (heat)
  • K Means Clustering (kmeans)
  • Dense Matrix Multiplication (dmm)
  • Medical Image Reconstruction Kernal (mri)

Only justifications about using cg, heat and mri were described in the paper.

16

slide-17
SLIDE 17

Benchmarks

17

Number of clusters vs Speedup over 1 cluster system

slide-18
SLIDE 18

Benchmarks - Conjugate Gradient Linear Solver

Description: Algorithm uses a sparse-matrix vector multiply constituting 85% of the sequential execution time. Each element in the large, read-only data array is accessed only once per iteration while performing the SMVM Motivation for choice of this benchmark:

  • Vectors generated each iteration are shared by cores within a cluster.
  • Vector modifications each iteration are exchanged through the global cache.
  • Prefetch op in Rigel ISA allows for data to bypass the global cache thus avoiding polluting

touch-once data not shared across clusters.

  • Rigel achieves good enqueue efficiency to satisfy high task input rates.

18

slide-19
SLIDE 19

Benchmarks - K Means Clustering

Description: Implements the K Means clustering Machine Learning algorithm where n-dimensional vectors are segregated into K bins such that the aggregate is minimum. Motivation for choice of this benchmark:

  • Performs efficient atomic operations at global caches instead of global reduction at the end of the

parallel sections.

  • Due to the high arithmetic intensity of the benchmark and high reuse in cluster caches, the

increased global cache traffic does not adversely impact performance.

19

slide-20
SLIDE 20

Benchmarks - Dense Matrix Multiply

Description: It has a very regular data access pattern with high arithmetic intensity. Motivation for choice of this benchmark:

  • Exploits Rigel’s ability to maximize effective use of location management, prefetching, cluster

cache management, global cache staging and added synchronization.

  • Exploits Rigel’s ability to support applications amenable to static partitioning.

20

slide-21
SLIDE 21

Evaluation

21

Rigel vs GPU comparison

slide-22
SLIDE 22

Conclusion

  • Rigel can achieve a compute density of over 8 single-precision GFLOPS/mm2 in 45nm with a more

flexible programming interface compared to conventional accelerators

  • Find that it is important to support fast task enqueue and dequeue operations and barriers, and

both can be implemented with a minimalist approach to specialized hardware.

22

slide-23
SLIDE 23

Discussion Points

23

slide-24
SLIDE 24

Discussion Points

  • We need to use the global cache every time for inter cluster communication, but know that these
  • perations are very expensive. Is it justifiable to call the accelerator mentioned in the paper a

programmable accelerator if it can’t efficiently handle workloads which involve such inter cluster

  • perations?

24

slide-25
SLIDE 25

Discussion Points

  • We need to use the global cache every time for inter cluster communication, but know that these
  • perations are very expensive. Is it justifiable to call the accelerator mentioned in the paper a

programmable accelerator if it can’t efficiently handle workloads which involve such inter cluster

  • perations?
  • In the paper, authors haven’t mentioned anything about optimal barrier placement. From a

programmer’s perspective, would it increase the complexity by needing to come up with the barrier placement in such a heavy parallel processing environment?

25

slide-26
SLIDE 26

Discussion Points

  • We need to use the global cache every time for inter cluster communication, but know that these
  • perations are very expensive. Is it justifiable to call the accelerator mentioned in the paper a

programmable accelerator if it can’t efficiently handle workloads which involve such inter cluster

  • perations?
  • In the paper, authors haven’t mentioned anything about optimal barrier placement. From a

programmer’s perspective, would it increase the complexity by needing to come up with the barrier placement in such a heavy parallel processing environment?

  • Is it worth the effort to build programmable accelerators useful for a variety of parallelizable

applications over specific ASICs for every domain?

26

slide-27
SLIDE 27

End

27