The HammerBlade: An ML-Optimized Supercomputer for ML and Graphs - - PowerPoint PPT Presentation

the hammerblade an ml optimized supercomputer for ml and
SMART_READER_LITE
LIVE PREVIEW

The HammerBlade: An ML-Optimized Supercomputer for ML and Graphs - - PowerPoint PPT Presentation

The HammerBlade: An ML-Optimized Supercomputer for ML and Graphs Prof. Michael B. Taylor (PI) University of Washington Prof. Adrian Sampson Prof. Luis Ceze University of Washington Cornell University Prof. Chris Batten Prof. Mark


slide-1
SLIDE 1

The HammerBlade: 
 An ML-Optimized Supercomputer 
 for ML and Graphs

1

Dec 2018

  • Prof. Michael B. Taylor (PI)

University of Washington

  • Prof. Luis Ceze

University of Washington

  • Prof. Adrian Sampson

Cornell University

  • Prof. Chris Batten

Cornell University

  • Prof. Zhiru Zhang

Cornell University

  • Prof. Mark Oskin

University of Washington

  • Dr. Dustin Richmond (Postdoc)

University of Washington

slide-2
SLIDE 2

2

Fast Intro to Today’s HW Landscape

The End of Moore’s Law Approaches Dennard Scaling Ended a Decade Ago

Energy is a fundamental limiter of all compute

Specialization Is the Solution

slide-3
SLIDE 3

3

HammerBlade: Key Insights

Key Intellectual Thrusts of The HammerBlade

How do we solve HW & SW Specialization Complexity?

Move from Human-Centric Computation Abstraction Hierarchy To a ML-Centric Computation Abstraction Hierarchy

slide-4
SLIDE 4

Language / API

Compiler / OS ISA Micro Arch Meta HDL HDL APR DFM Design Rules

4

HammerBlade: Key Insights

Key Intellectual Thrusts of The HammerBlade

How do we solve HW & SW Specialization Complexity?

Move from Human-Centric Computation Abstraction Hierarchy To a ML-Centric Computation Abstraction Hierarchy

Computation

Physics

Human Centric Abstraction Layers

slide-5
SLIDE 5

Language / API

Compiler / OS ISA Micro Arch Meta HDL HDL APR DFM Design Rules

5

HammerBlade: Key Insights

Key Intellectual Thrusts of The HammerBlade

How do we solve HW & SW Specialization Complexity?

Move from Human-Centric Computation Abstraction Hierarchy To a ML-Centric Computation Abstraction Hierarchy

Computation

Physics

Human Centric Abstraction Layers Computation

Physics

Language / API

Compiler / OS ISA Micro Arch Meta HDL HDL APR DFM Design Rules

ML Centric Abstraction Layers

slide-6
SLIDE 6

6

HammerBlade: Key Insights

Key Intellectual Thrusts of The HammerBlade

How do we solve HW & SW Specialization Complexity?

Move from Human-Centric Computation Abstraction Hierarchy To a ML-Centric Computation Abstraction Hierarchy

Redesign the compute stack knowing that Machine Learning Will Drive How Computation is Realized in HW & SW

slide-7
SLIDE 7

7

HammerBlade: Key Insights

Key Intellectual Thrusts of The HammerBlade

How do we solve HW & SW Specialization Complexity?

Move from Human-Centric Computation Abstraction Hierarchy To a ML-Centric Computation Abstraction Hierarchy

Redesign the compute stack knowing that Machine Learning Will Drive How Computation is Realized in HW & SW

ML Co-designing HW/SW .. for ML

ML Co-designing HW/SW .. for Graphs ML Co-designing HW/SW .. for Graphs & ML

slide-8
SLIDE 8

8

HammerBlade: Key Insights

Key Intellectual Thrusts of The HammerBlade

How do we solve HW Specialization’s Inflexibility?

Seamless blend of specialization at multiple levels, with a focus

  • n tight interoperability...

CGRA FPGA ASIC Hard Blocks RISC-V CPUs Memory System Interconnect

Dark Silicon: Not everything is on; Target metric is energy-efficiency not utilization

slide-9
SLIDE 9

9

HammerBlade: Key Insights

Key Intellectual Ideas of The HammerBlade

How do we address the long binding times of specialization in response to changing datasets?

Move from year/month/day specialization times to minutes/secs/microsec?

ML-stitching of ML-predesigned Fabric & Domain Specific Templates

EasyML Human-Centric DSL EasyGraph Human-Centric DSL PyTorch TensorFlow MxNET TVM (tensor VM) ML-based compilation High-level

  • ptimizations

Schedule decoupling (Halide) FPGA Domain Template Language Manycore Domain Template Library ASIC Block Domain Template Library RunTime Domain Template Library CGRA Domain Template Library NOC/Mem Domain Template Library Vertex Centric Edge Centric Graphit GVM (graph VM) Adaptive Layout Engine Zuppa

slide-10
SLIDE 10

10

The HammerBlade Hardware Architecture

slide-11
SLIDE 11

11

Program Overview

HammerBlade Chimera Tile

ML-Designed CGRA Fabric (Incl. ASIC Hard Blocks) RISC-V RV32 Cores ML-Tuned FPUs ML-Configured Reconfigurable Local Memory ML-Programmed Interconnections

slide-12
SLIDE 12

12

Specialized Intertile Network Fabrics

slide-13
SLIDE 13

13

Program Overview

HammerBlade ASIC

8K Chimera Tiles 128 RISC-V 64bit Linux Capable Cores Reconfigurable LLC Reconfigurable I/O 14 & 7nm, large die

Linux-Capable RISC-V RV64G Core – UW/BU Black Parrot; funded by DARPA POSH

slide-14
SLIDE 14

14

Leveraging Celerity’s Manycore into HammerBlade Manycore/CGRA Hybrid Celerity (opencelerity.org, IEEE Micro ‘18 Paper): Broke RISC-V performance record by 100X (500B RISC-V ops per sec) Silicon proven in 16nm. Open Source. 50 processors per mm2 DARPA CRAFT HammerBlade: Exponentially better programmability & perf. robustness I-caches in Chimera Tiles (CTs), initial version Memory hierarchy, initial version Latency Hiding in CTs (non-blocking loads & stores) Unified Physical Address Space, initial version Preserve amazing compute density and efficiency Logic for fully pipelined processor and high performance mesh router takes less space than 4K of SRAM (!) Integrate CGRA functionality without tile size explosion

HammerBlade Manycore

Implemented?

slide-15
SLIDE 15

15

36-tile HammerBlade Proto in TSMC 40nm; enroute to 16nm

slide-16
SLIDE 16

16

HammerBlade PCB

slide-17
SLIDE 17

17

Program Overview

HammerBlade Chassis

slide-18
SLIDE 18

18

HammerBlade System

slide-19
SLIDE 19

19

Memory Transmutation Layer

Dynamically Optimizing Data Movement Across the Full Machine Hierarchy

slide-20
SLIDE 20

20

Program Overview

The HammerBlade

A Supercomputer Appliance for ML & Graphs (TA1) with a Dynamically Evolving Software Stack (TA2)

Bare Metal Hardware Personality Compiler & Runtime Application HW/SW Interface Continuous Synthesis: learning-based empirical co-design, from design to execution.

slide-21
SLIDE 21

21

Q1 Reporting/Updates: TA-2

HBIR

slide-22
SLIDE 22

22

Technical Approach: Software Abstraction Layers

HammerML Human-Centric DSL HammerGraph Human-Centric DSL PyTorch TensorFlow MxNET TVM (tensor VM)

ML-based compilation

High-level

  • ptimizations

Schedule decoupling (Halide)

FPGA Domain Template Language Manycore Domain Template Library ASIC Block Domain Template Library RunTime Domain Template Library CGRA Domain Template Library NOC/Mem Domain Template Library GraphIt GVM (graph VM) HammerBlade Bare Metal

CUDA Lite HBIR CUDA Lite HBIR

slide-23
SLIDE 23

23

TVM: Extensible, End-to-end Compilation for Deep Learning

160+ contributors, several industrial users. Try it out!

slide-24
SLIDE 24

23

TVM: Extensible, End-to-end Compilation for Deep Learning

160+ contributors, several industrial users. Try it out!

slide-25
SLIDE 25

23

TVM: Extensible, End-to-end Compilation for Deep Learning

Significant engineering cost to

  • ptimize this mapping.

Billions of possibilities. 160+ contributors, several industrial users. Try it out!

slide-26
SLIDE 26

24

AutoTVM: Automating Code Optimizations using ML

  • Works very well for low-level ML code optimization

[NIPS’18 spotlight]

○ e.g., beats hand-tuned-by-nVIDIA TitanX CUDA code

  • Now applying to HW design exploration

○ Produce HW design variants, evaluate with compiler-in- the-loop ○ Learn HW design parameters->performance (timing, power) ○ Next: from code->HW variant->performance

slide-27
SLIDE 27

24

AutoTVM: Automating Code Optimizations using ML

AutoTVM Conv2d example on TitanX

  • Works very well for low-level ML code optimization

[NIPS’18 spotlight]

○ e.g., beats hand-tuned-by-nVIDIA TitanX CUDA code

  • Now applying to HW design exploration

○ Produce HW design variants, evaluate with compiler-in- the-loop ○ Learn HW design parameters->performance (timing, power) ○ Next: from code->HW variant->performance

slide-28
SLIDE 28

25

Bespoke

Commodity

Current open implementation @ tvm.ai/vta

Decoupled Access-Execute 
 Deep Learning Accelerator Templates

slide-29
SLIDE 29

26

AutoVTA: Automatic Exploration of HW-SW Co-design
 w/ Compiler-in-the-loop (P1-TA2.4)

VTA variants: 1000s 10s

slide-30
SLIDE 30

26

AutoVTA: Automatic Exploration of HW-SW Co-design
 w/ Compiler-in-the-loop (P1-TA2.4)

VTA variants: 1000s 10s Apply AutoTVM

Selected designs with best End-to-End performance.

slide-31
SLIDE 31

27

Short term benefit to having an existing IR for architects to program the manycore.

  • CUDA can express independent computation and locality and it is widely used.
  • Inability to support CUDA constructs efficiently can identify issues in HB design
  • TVM already lowers to CUDA
  • Easy to port pre-existing CUDA code over for architectural testing.
  • High Levels of Interest from Industry for RISC-V Manycore programmable w/ CUDA

“CUDA Lite” – A Near Term IR for HB Manycore

__global__ void add (int* a, int* b, int* c) { int tid = threadIdx.x ; if (tid < N) // out-of-bound checks c[tid] = a[tid] + b[tid]; }

hb_tile void add (int* a, int* b, int* c) { // thread loop #pragma unroll for ( int x=hb_gangIndex; x < blockDim.x; x+= hb_gangSize){ c[x] = a[x] + b[x]; } } 


CUDA Manycore Translation

slide-32
SLIDE 32
  • Stacked DRAM dies connected by TSVs
  • 4 dies per HBM device
  • 2 devices per HammerBlade package
  • Interposer connects SoC and HBM Dies
  • Shorter wires, higher density vs DDR4 PCB (Low Power)
  • Each die has 4 “semi-independent” channels
  • 16x bandwidth vs DDR4 (4 dies * 4 channels)
  • Interface is wide/slow (Low Power)
  • DDR4 is narrow/fast (High Power)

High Bandwidth Memory (HBM2) Analysis

More bandwidth, more parallelism and lower energy per bit

HBM2 Device (Stack)

HammerBlade Device

HBM HBM Interposer

slide-33
SLIDE 33

29

  • HammerBlade Emulation on Amazon F1 FPGA Cloud
  • Architecture
  • Synthesize HB RTL to FPGA
  • HBM2 is emulated with DRAM on board
  • High Bandwidth Trace goes out of board to Intel Xeon Disk (~16 GB/sec)
  • Lower bandwidth I/O, controlled by C on Xeon
  • Write code and initialization data
  • Limited interactive debug
  • Software users can run our hardware at OS-capable speeds
  • No need for them to buy FPGA boards and tools, or license HW tools
  • Challenges
  • $1.65/hr is cheap unless you forget it’s running for a week!
  • Compile times mean that simulation is a faster iteration methodology.
  • Main benefits are big workloads, scalability, and usability for non CAD-tool savvy users

Emulation on Amazon F1 FPGA Cloud

slide-34
SLIDE 34

Thank You

30

Dec 2018

  • Prof. Michael B. Taylor (PI)

University of Washington

  • Prof. Luis Ceze

University of Washington

  • Prof. Adrian Sampson

Cornell University

  • Prof. Chris Batten

Cornell University

  • Prof. Zhiru Zhang

Cornell University

  • Prof. Mark Oskin

University of Washington

  • Dr. Dustin Richmond (Postdoc)

University of Washington