Interface, Data, Approximation Sarita Adve With: Vikram Adve, - - PowerPoint PPT Presentation

interface data approximation
SMART_READER_LITE
LIVE PREVIEW

Interface, Data, Approximation Sarita Adve With: Vikram Adve, - - PowerPoint PPT Presentation

Programming Systems for Specialized Architectures Interface, Data, Approximation Sarita Adve With: Vikram Adve, Johnathan Alsop, Maria Kotsifakou, Sasa Misailovic, Matt Sinclair, Prakalp Srivastava University of Illinois at Urbana-Champaign


slide-1
SLIDE 1

Programming Systems for Specialized Architectures

Interface, Data, Approximation

Sarita Adve

With: Vikram Adve, Johnathan Alsop, Maria Kotsifakou, Sasa Misailovic, Matt Sinclair, Prakalp Srivastava University of Illinois at Urbana-Champaign sadve@illinois.edu Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)

slide-2
SLIDE 2

Main Memory Interconnect Modem GPS

DSP DSP

GPU

A/V Hardware Accelerators DSP

Multi- media CPU L1 Cache L2 Cache CPU L1 Cache

Vector Vector

Different parallelism models Incompatible memory systems Different hardware ISAs

A Modern Mobile SoC

Increasing diversity in & across SoCs & supercomputers, data centers, … Need common interface (abstractions): HW-independent SW development, “object code” portability Data movement critical: Memory structures, communication, consistency, synchronization Approximation: Application-driven solution quality trade off to increase efficiency

slide-3
SLIDE 3

Interfaces: Back to the Future

April 7, 1964: IBM announced the 360

  • Family of machines w/ common abstraction/interface/ISA

– Programmer freedom: no reprogramming – Designer freedom: implementation creativity

Not unique

  • CPUs : ISAs; Internet : IP; GPUs : CUDA; Databases : SQL; …
slide-4
SLIDE 4

Current Interface Levels

CPUs + Vector SIMD Units … GPU DSP

Domain-specific Accelerators

FPGA

"Hardware" ISA Virtual ISA Language-neutral Compiler IR Language-level Compiler IR General-purpose prog. language Domain-specific prog. language

IBM AS/400, Transmeta, PTX, HSAIL, Codesigned Virtual Machines SPIR, HPVM Delite IR, HPVM, OSCAR, Polly Delite DSL IR, DLVM, TVM, … CUDA, OpenCL, OpenAcc, OpenMP, Python, Julia TensorFlow, MXNet, Halide, …

Hardware innovation Object-code portability Compiler investment Language innovation

  • App. performance
  • App. productivity

Source: Vikram Adve, HPVM project, https://publish.illinois.edu/hpvm-project/

slide-5
SLIDE 5

Which Interface Levels Can Be Uniform?

CPUs + Vector SIMD Units … GPU DSP

Domain-specific Accelerators

FPGA

"Hardware" ISA Virtual ISA Language-neutral Compiler IR Language-level Compiler IR General-purpose prog. language Domain-specific prog. language

Too diverse to define a uniform interface Also too diverse … Much more uniform

Source: Vikram Adve, HPVM project, https://publish.illinois.edu/hpvm-project/

slide-6
SLIDE 6

One Example

HPVM: Heterogeneous Parallel Virtual Machine [PPoPP’18] Parallel program representation for heterogeneous parallel hardware

  • Virtual ISA: portable virtual object code, simpler translators
  • Compiler IR: optimizations, map diverse parallel languages
  • Runtime Representation for flexible scheduling: mapping, load balancing

Generalization of LLVM IR for parallel heterogeneous hardware PPoPP’18: Results on GPU (Nvidia), Vector ISA (AVX), Multicore (Intel Xeon) Ongoing: FPGA, novel domain-specific SoCs

slide-7
SLIDE 7

Dataflow Graph with side effects Vector

VA = load <L4 x float>* A VB = load <L4 x float>* B … VC = fmul <L4 x float> VA, VB

Hierarchical

  • r

HPVM Abstractions

slide-8
SLIDE 8

Dataflow Graph with side effects Vector

VA = load <L4 x float>* A VB = load <L4 x float>* B … VC = fmul <L4 x float> VA, VB

Hierarchical

  • r

HPVM Abstractions

  • Task, data, vector parallelism
  • Streams, pipelines
  • Shared memory
  • High-level optimizations
  • FPGAs (more custom hw?)

N different parallelism models

single unified model

slide-9
SLIDE 9

Data

Data movement critical to efficiency

  • Memory structures
  • Communication
  • Coherence
  • Consistency
  • Synchronization

Uniform communication interface for hardware Abstract to software interface

inter-chip IF

  • Accel. 1

IF IF

  • Accel. 3

cache

IF IF

stash

  • Accel. 2
  • Accel. 4

coherent FIFO RDMA

inter-chip IF inter-chip IF

slide-10
SLIDE 10

inter-chip IF

Application-Customized Accelerator Communication Arch

  • Accel. 1

IF IF

  • Accel. 3

cache

IF IF

stash

  • Accel. 2
  • Accel. 4

coherent FIFO RDMA

inter-chip IF inter-chip IF

Problem: Design + Integrate Multiple accelerator memory systems + Communication Challenges: ‒Friction between different app-specific specializations ‒Inefficiencies due to deep memory hierarchy ‒Multiple scales: on-chip to cloud New accelerator communication architecture ‒Coherent, global address space ‒App-specialized coherence, comm, storage, soln quality One example next focused on coherence: Spandex [ISCA’18]

slide-11
SLIDE 11

11

Spatial locality Temporal locality Throughput Sensitivity Latency Sensitivity Fine-grain Synch

Heterogeneous devices have diverse memory demands

slide-12
SLIDE 12

12

Spatial locality Temporal locality Throughput Sensitivity Latency Sensitivity Fine-grain Synch

Typical CPU workloads: fine-grain synch, latency sensitive

Heterogeneous devices have diverse memory demands

slide-13
SLIDE 13

Spatial locality Temporal locality Throughput Sensitivity Latency Sensitivity Fine-grain Synch

Typical GPU workloads: spatial locality, throughput sensitive

Heterogeneous devices have diverse memory demands

slide-14
SLIDE 14

MESI coherence targets CPU workloads

Protocol properties MESI GPU coherence DeNovo Granularity Line Reads: line writes: word Reads: flexible Writes: word Stale data invalidation Writer-invalidate Self-invalidate Self-invalidate Write propagation Ownership Write-through Write-back

MESI GPU coh. DeNovo

Good for:

GPU CPU CPU or GPU

GPU Coherence

  • Fine-grain writes

 No false sharing  Reduced spatial locality

  • Self invalidation

 Simple, scalable  Synch limits read reuse

  • Write-through caches

 Simple, low overhead  Synch limits write reuse

MESI

  • Coarse-grain state

 Spatial locality  False sharing

  • Writer-initiated invalidation

 Temporal locality for reads  Overheads limit throughput, scalability

  • Ownership-based updates

 Temporal locality for writes  Indirection if low locality

slide-15
SLIDE 15

GPU coherence fits GPU workloads

Protocol properties MESI GPU coherence DeNovo Granularity Line Reads: line writes: word Reads: flexible Writes: word Stale data invalidation Writer-invalidate Self-invalidate Self-invalidate Write propagation Ownership Write-through Write-back

MESI GPU coh. DeNovo

Good for:

GPU CPU CPU or GPU

15

GPU Coherence

  • Fine-grain writes

 No false sharing  Reduced spatial locality

  • Self invalidation

 Simple, scalable  Synch limits read reuse

  • Write-through caches

 Simple, low overhead  Synch limits write reuse

slide-16
SLIDE 16

DeNovo is good fit for CPU and GPU

Protocol properties MESI GPU coherence DeNovo Granularity Line Reads: line writes: word Reads: flexible Writes: word Stale data invalidation Writer-invalidate Self-invalidate Self-invalidate Write propagation Ownership Write-through Ownership

MESI GPU coh. DeNovo

Good for:

GPU CPU CPU or GPU

slide-17
SLIDE 17

Integrating Diverse Coherence Strategies

Existing Solutions: MESI-based LLC

  • Accelerator Requests forced to use MESI
  • Added latency for inter-device communication
  • MESI is complex: extensions are difficult

CPU GPU

FPGA/ ASIC ?

MESI LLC

MESI/GPU coh. Hybrid L2

MESI L1

GPU

  • coh. L1

MESI L1 GPU

GPU

  • coh. L1

CPU GPU

FPGA/ ASIC ?

MESI L1 GPU coh. L1 DeNovo L1

Spandex LLC

Spandex: DeNovo-based interface [ISCA’18]

  • Supports write-through and write-back
  • Supports self-invalidate and writer-invalidate
  • Supports requests of variable granularity
  • Directly interfaces MESI, GPU coherence, hybrid

(e.g. DeNovo) caches

slide-18
SLIDE 18

Example: Collaborative Graph Applications

Vertex-centric algorithms: distribute vertices among CPU, GPU threads

18

Application Access Pattern Important Dimension Results Pull-based PageRank Read neighbor vertices, Update local vertex Flat LLC avoids indirection for read misses Spandex LLC ⇒ 37% better exec. time 9% better NW traffic Push-based Betweenness Centrality Read local vertex, Update (RMW) neighbor vertices Ownership-based write propagation exploits locality in updates DeNovo at GPU ⇒ 18% better exec. time 61% better NW traffic

slide-19
SLIDE 19

Looking Forward…

HPVM + DRF Consistency + ???

Synchronization locality Producer/consumer relationships Data locality, visibility Coarse-grain

  • perations

Software Innovations Hardware Innovations

hLRC adaptive laziness HBM caches Spandex dynamic caches Hardware queues Coherent scratchpads

Stash, ISCA’15

+

NVRAM

+

slide-20
SLIDE 20

Approximation

How to express quality of solution from the application to the hardware? Integrate approximation (quality) into the interface

slide-21
SLIDE 21

Summary

  • Interfaces
  • Data
  • Approximation