Software, Architecture, and VLSI Co-Design for Efficient Task-Based - - PowerPoint PPT Presentation

software architecture and vlsi co design for efficient
SMART_READER_LITE
LIVE PREVIEW

Software, Architecture, and VLSI Co-Design for Efficient Task-Based - - PowerPoint PPT Presentation

Software, Architecture, and VLSI Co-Design for Efficient Task-Based Parallel Runtimes Christopher Torng Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University Motivation Task-Based Parallelism


slide-1
SLIDE 1

Software, Architecture, and VLSI Co-Design for Efficient Task-Based Parallel Runtimes

Christopher Torng

Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University

slide-2
SLIDE 2
  • Motivation •

Task-Based Parallelism Voltage Regulation Rapid ASIC Design Future Research

Emerging New Contexts Demand Better Hardware

Pushing Intelligence to the Edge

I Better local security I Faster response times I Lower data-movement energy I Many more...

Source: Lanner

Peak Performance Energy Efficiency TI MSP430 10+ years CR2032 coin Standby mode Inference 2.5 sec Image 28 x 28 MNIST dataset

Source: Gobieski ASPLOS'19 Cornell University Christopher Torng 2 / 56

slide-3
SLIDE 3
  • Motivation •

Task-Based Parallelism Voltage Regulation Rapid ASIC Design Future Research

Emerging New Contexts Demand Better Hardware

Machine Learning Graph Analytics

I Cybersecurity I Smart Healthcare I Smart Home I Augmented Reality I Virtual Reality I Autonomous Driving

How can we drastically improve performance and energy efficiency for these new emerging contexts?

Cornell University Christopher Torng 3 / 56

slide-4
SLIDE 4
  • Motivation •

Task-Based Parallelism Voltage Regulation Rapid ASIC Design Future Research

Motivating Trends in Computer Architecture

Transistors (Thousands) MIPS R2K Intel P4

Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten

1975 1980 1985 1990 1995 2000 2005 2010 2015 10 10

1

10

2

10

3

10

4

10

5

10

6

10

7

DEC Alpha 21264 Typical Power (W) Frequency (MHz) SPECint Performance ~9%/year ~ 1 5 % / y e a r Number

  • f Cores

Intel 48-Core Prototype AMD 4-Core Opteron

Cornell University Christopher Torng 4 / 56

slide-5
SLIDE 5
  • Motivation •

Task-Based Parallelism Voltage Regulation Rapid ASIC Design Future Research

Excitement After Moore’s Law

Register-Transfer Level Circuits Devices Instruction Set Architecture Programming Language Algorithm Microarchitecture Technology Application Operating System Gate Level Compiler Carbon Nanotubes Quantum Computing Molecular Computing Energy Harvesting Biodegradable Computing Phase-Change Memory AI Smart Healthcare Smart Home Graph Analytics Cybersecurity Autonomous Driving AR / VR

Computer Architecture

Cornell University Christopher Torng 5 / 56

slide-6
SLIDE 6

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Building Future Computing Systems that Bridge Software, Architecture, and VLSI

Cross-Stack Co-Design for Task-Based Parallel Runtimes - ISCA’16, MICRO’17, RISCV’18 Cross-Stack Co-Design for Integrated Voltage Regulation - MICRO’14, IEEE TCAS I’18 Cross-Stack Co-Design for Rapid ASIC Design - IEEE MICRO’18, DAC’18, Hotchips’17 Future Research

Cornell University Christopher Torng 6 / 56

slide-7
SLIDE 7

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Building Future Computing Systems that Bridge Software, Architecture, and VLSI

Cross-Stack Co-Design for Task-Based Parallel Runtimes - ISCA’16, MICRO’17, RISCV’18 Cross-Stack Co-Design for Integrated Voltage Regulation - MICRO’14, IEEE TCAS I’18 Cross-Stack Co-Design for Rapid ASIC Design - IEEE MICRO’18, DAC’18, Hotchips’17 Future Research

Cornell University Christopher Torng 7 / 56

slide-8
SLIDE 8

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Cross-Stack Co-Design for Task-Based Parallelism

Work-Stealing Runtimes

Single-ISA Heterogeneous Architectures Dynamic Voltage and Frequency Scaling

Dynamic Asymmetry Static Asymmetry

How can we use asymmetry awareness to improve the performance and energy efficiency of a work-stealing runtime?

Cornell University Christopher Torng 8 / 56

slide-9
SLIDE 9

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Work-Stealing Runtimes

Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task D Task E Task F Steal Task E Steal Task F

I Work stealing has good performance, space requirements, and

communication overheads in both theory and practice

I Supported in many popular concurrency platforms including:

Intel’s Cilk Plus, Intel’s C++ TBB, Microsoft’s .NET Task Parallel Library, Java’s Fork/Join Framework, and OpenMP

Cornell University Christopher Torng 9 / 56

slide-10
SLIDE 10

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Static Asymmetry vs. Dynamic Asymmetry

Samsung Exynos Octa Mobile Processor

Big ARM Cores Little ARM Cores A7 A7 A15 A15 L2$ L2$ A7 A7 A15 A15

100 150 200 250 300 350 400 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 Time (ns) Voltage (V) 120 ns 150 ns

Integrated Voltage Regulation

Energy Performance Fmin @ Vmin Fmax @ Vmax Fnom @ Vnom

Cornell University Christopher Torng 10 / 56

slide-11
SLIDE 11

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry

Single-ISA Heterogeneous Architectures Dynamic Voltage and Frequency Scaling

Bender et al. "Online Scheduling

  • f Parallel Programs on

Heterogeneous Sys ..." Theory of Computing Systems 2002 Ribic et al. "Energy-Efficient Work-Stealing Language Runtimes" ASPLOS 2014 Azizi et al. "Energy-performance Tradeoffs in Processor Architecture and Circuit Design: A Marginal Cost Analysis" ISCA 2010

How can we use asymmetry awareness to improve the performance and energy efficiency of a work-stealing runtime?

Cornell University Christopher Torng 11 / 56

slide-12
SLIDE 12

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

L L B B

Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry

Work-Pacing Work-Sprinting Work-Mugging

Cross-Stack Co-Design for Task-Based Parallelism Let’s start with some first-order modeling to build intuition

Cornell University Christopher Torng 12 / 56

slide-13
SLIDE 13

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Building Intuition by Exploring a 1 Big 1 Little System

Normalized Power Normalized Instructions Per Second (IPS) 0.5 1.0 1.5 2.0 2.5 3.0 8 7 6 5 4 3 2 1 L B System with 1 big 1 little Little Core Four-Issue Big Core (2.0, 6.0) (1.0, 1.0) B L 7.0 P

  • w

e r B L 3.0 I P S B L L B B L L B

Same Power 10% Performance Increase 10% Energy Efficiency Increase

Cornell University Christopher Torng 13 / 56

slide-14
SLIDE 14

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

The Law of Equi-Marginal Utility

Normalized Instructions Per Second (IPS) 0.5 1.0 1.5 2.0 2.5 3.0 8 7 6 5 4 3 2 1

Alfred Marshall (1824 - 1924)

"Other things being equal, a consumer gets maximum satisfaction when he allocates his limited income to the purchase of different goods in such a way that the Marginal Utility derived from the last unit of money spent on each item of expenditure tend to be equal."

British Economist

Normalized Power

Balance the ratio of utility (IPS) to cost (power)

Slope Slope 0.9 V 1.3 V

Arbitrage "Buy Low, Sell High"

Cornell University Christopher Torng 14 / 56

slide-15
SLIDE 15

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Systematic Approach for Balancing Marginal Utility

Normalized Energy Efficiency Normalized IPS 0.6 0.8 1.0 1.2 1.4 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 isopower

1 Big 1 Little System at Nominal voltage

Assumptions

Individual (VB, VL) pair

Perfectly parallel application Ideal load balancing

Pareto-Optimal Frontier

Marginal Utility-Based Optimization Problem Constraint: isopower line Objective: maximize performance (Solved numerically)

Cornell University Christopher Torng 15 / 56

slide-16
SLIDE 16

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

L L B B

Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry

Work-Pacing Work-Sprinting Work-Mugging

Cross-Stack Co-Design for Task-Based Parallelism Let’s explore three specific techniques to balance marginal utility in a work-stealing runtime

Cornell University Christopher Torng 16 / 56

slide-17
SLIDE 17

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Work-Pacing: Building Intuition

Balance performance/power across cores in the high-parallel (HP) region

L L B B

Busy Steal Loop 2 Big, 2 Little

System with both big cores active and both little cores active B L

Normalized Power Normalized IPS

0.0 0.5 1.0 1.5 2.0 2.5 3.0 1 2 3 4 5 6 7

VL VB Marginal IPS/W

0.70 1.00 1.30 1.60 1.90 1.04 1.00 0.92 0.76 0.24

Aggregate Throughput

Aggregate System IPS

0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.70 1.00 1.30 1.60 1.90 1.04 1.00 0.92 0.76 0.24

VL VB

B L

Cornell University Christopher Torng 17 / 56

slide-18
SLIDE 18

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Work-Pacing, Work-Sprinting, and Work-Mugging

L L B B

Steal Loop Busy Work-Pacing Balance performance/power across cores in the high-parallel (HP) region Rest cores in the steal loop to the lowest voltage With additional power slack, balance performance/power across busy cores in the low-parallel (LP) region Work-Sprinting Work-Mugging Move work from slow little cores to fast big cores in the low-parallel (LP) region Inspired by theoretical work - Bender et al. Theory of Computing '02

Cornell University Christopher Torng 18 / 56

slide-19
SLIDE 19

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

L L B B

Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry

Work-Pacing Work-Sprinting Work-Mugging

Cross-Stack Co-Design for Task-Based Parallelism We have three techniques for balancing marginal utility (but we’re missing something)

Cornell University Christopher Torng 19 / 56

slide-20
SLIDE 20

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Augmenting the Software/Architecture Interface

L L B B

Steal Loop Busy

Register-Transfer Level Circuits Devices Instruction Set Architecture Programming Language Algorithm Microarchitecture Technology Application Operating System Gate Level Compiler

L L B B L L B B

Annotate the work-stealing runtime with hints

Cornell University Christopher Torng 20 / 56

slide-21
SLIDE 21

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Popping Back Up a Level

Register-Transfer Level Circuits Devices Instruction Set Architecture Programming Language Algorithm Microarchitecture Technology Application Operating System Gate Level Compiler Carbon Nanotubes Quantum Computing Molecular Computing Energy Harvesting Biodegradable Computing Phase-Change Memory AI Smart Healthcare Smart Home Graph Analytics Cybersecurity Autonomous Driving AR / VR

Computer Architecture

Task-Based Parallel Runtimes Integrated Voltage Regulation

Cornell University Christopher Torng 21 / 56

slide-22
SLIDE 22

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

L L B B

Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry

Work-Pacing Work-Sprinting Work-Mugging

Cross-Stack Co-Design for Task-Based Parallelism For the detail-minded, here are the specific mechanisms

Cornell University Christopher Torng 22 / 56

slide-23
SLIDE 23

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Work-Pacing and Work-Sprinting Mechanisms

On-Chip Interconnect L1$ L1$ B L Voltage Regulators DRAM Memory Controller L1$ L1$ B L Shared L2$ Cache Banks DVFS Controller

Work in Progress Task Queues

Big Big Little Little

Which cores are stealing? Big or little?

Task A Task B

Hints D e c i s i

  • n

Vdd

B L B L

A A A A

VB VL A = Active, S = Stealing

0.91V 1.30V

A A A s

0.98V 1.30V 0.70V

Stealing

Task C

A A s s A s A A A s A s A s s s s s A A s s A s s s s s

1.03V 1.30V 1.04V 1.30V 1.13V 1.30V 1.21V 0.70V 0.70V 1.30V 0.70V 1.30V 0.70V 0.70V 0.70V 0.70V 0.70V 0.70V 0.70V 0.70V 0.70V

Activity Pattern Voltages

Cornell University Christopher Torng 23 / 56

slide-24
SLIDE 24

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Work-Mugging Mechanisms

On-Chip Interconnect L1$ L1$ B L DRAM Memory Controller L1$ L1$ B L Shared L2$ Cache Banks Voltage Regulators DVFS Controller User-Level Interrupt Network Initiate Mug Mug Interrupt Thread Context Swap

Mug Instruction

I Thread ID to mug I Address of thread-swapping handler

User-Level Interrupt Network

I Simple, low-bandwidth inter-core network I Latency on order of 20 cycles

Thread Context Swap

I Threads store architectural state to

separate locations in shared memory

I Both threads sync I Threads load architectural state from

  • ther location

Cornell University Christopher Torng 24 / 56

slide-25
SLIDE 25

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

L L B B

Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry

Work-Pacing Work-Sprinting Work-Mugging

Cross-Stack Co-Design for Task-Based Parallelism Let’s see an asymmetry-aware work-stealing runtime in action

Cornell University Christopher Torng 25 / 56

slide-26
SLIDE 26

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Evaluation Methodology: Modeling

Work-Stealing Runtime

I State-of-the-art Intel TBB-inspired work-stealing scheduler I Chase-Lev task queues with occupancy-based victim selection I Instrumented with activity hints

Cycle-Level Modeling

I Heterogeneous system modeled in gem5 cycle-approximate simulator I Support for scaling per-core frequencies + central DVFS Controller

Energy Modeling

I Event-based energy modeling based on detailed RTL/gate-level sims

(Synopsys ASIC toolflow, TSMC LP , 65 nm 1.0 V)

I Carefully selected subset of McPAT results tuned to our µarchitecture

Cornell University Christopher Torng 26 / 56

slide-27
SLIDE 27

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Work-Pacing in cilk-sort

time Big Little

No AAWS Techniques Busy Steal Loop Activity Bar DVFS Controller Decision 0.70 V 1.04 V 1.24 V 1.00 V 1.30 V 0.90 V Work-Pacing

Big Little time

Speedup

  • f 1.11x

Normalized Energy Efficiency 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Performance 1.0 1.2 1.1 1.3 1.4 1.5 isopower

Speedup 1.11x Energy Efficiency 1.11x

Cornell University Christopher Torng 27 / 56

slide-28
SLIDE 28

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Work-Sprinting in quicksort

1.00 V 0.70 V 1.04 V 1.24 V Busy Steal Loop 1.30 V 1.10 V

time Big Little

No AAWS Techniques DVFS Controller Decision Activity Bar Work-Sprinting

Big Little time

Speedup

  • f 1.34x

Normalized Energy Efficiency 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Performance 1.0 1.2 1.1 1.3 1.4 1.5 isopower

Speedup 1.34x Energy Efficiency 1.16x

Cornell University Christopher Torng 28 / 56

slide-29
SLIDE 29

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Work-Mugging in radix sort

Big Little

No AAWS Techniques

time

Busy Steal Loop Activity Bar DVFS Controller Decision 1.00 V 0.90 V 0.70 V 1.04 V 1.24 V 1.30 V Work-Mugging

Big Little time

Speedup 1.17x

Normalized Energy Efficiency 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Performance 1.0 1.2 1.1 1.3 1.4 1.5 isopower

Speedup 1.17x Energy Efficiency 1.40x

Cornell University Christopher Torng 29 / 56

slide-30
SLIDE 30

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Evaluation of Complete AAWS Runtime

0.0 0.2 0.4 0.6 0.8

Normalized Energy Efficiency

0.9 1.0 1.1 1.2 1.3 1.4 1.5

Performance

1.0 1.2 1.1 1.3 1.4 1.5 isopower

pbbs-bfs pbbs-quicksort pbbs-samplesort pbbs-dictionary pbbs-convex-hull pbbs-radix-sort pbbs-knn pbbs-max-independent-set pbbs-nbody pbbs-remove-duplicates pbbs-suffix-array pbbs-spanning-tree cilk-cholesky cilk-cilksort cilk-heat cilk-knapsack cilk-matrix-multiply parsec-blackscholes unbalanced-tree-search

Application Kernels

Median: 1.10 x Max: 1.32 x Performance Median: 1.11 x Max: 1.53 x Energy Efficiency

Cornell University Christopher Torng 30 / 56

slide-31
SLIDE 31

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Popping Back Up a Level

Register-Transfer Level Circuits Devices Instruction Set Architecture Programming Language Algorithm Microarchitecture Technology Application Operating System Gate Level Compiler Carbon Nanotubes Quantum Computing Molecular Computing Energy Harvesting Biodegradable Computing Phase-Change Memory AI Smart Healthcare Smart Home Graph Analytics Cybersecurity Autonomous Driving AR / VR

Computer Architecture

Task-Based Parallel Runtimes Integrated Voltage Regulation

1 bit

Cornell University Christopher Torng 31 / 56

slide-32
SLIDE 32

Motivation

  • Task-Based Parallelism •

Voltage Regulation Rapid ASIC Design Future Research

Prototyping to Support Research Results

1.25 mm 1.0 mm

I$ Tag I$ Data

Bloom Filter Accel

Shared MDU Shared FPU

L0 P L L I$ Tag I$ Data D$ Tag D$ Data D$ Tag D$ Data Core Core Core Core

Batten Research Group Test Chip 2 Digital Test Chip, TSMC 28 nm Static Timing Analysis @ 500 MHz

I Completed in 2 months I Runs a work-stealing

runtime (RISC-V XCC)

I Four RISC-V cores +

32kB L1 caches

I Aggressively shared

long-latency resources

I Microarchitectural smart

sharing mechanisms

I Synthesizable PLL

Results supporting multiple paper submissions to top-tier venues

Cornell University Christopher Torng 32 / 56

slide-33
SLIDE 33

Motivation Task-Based Parallelism

  • Voltage Regulation •

Rapid ASIC Design Future Research

Building Future Computing Systems that Bridge Software, Architecture, and VLSI

Cross-Stack Co-Design for Task-Based Parallel Runtimes - ISCA’16, MICRO’17, RISCV’18 Cross-Stack Co-Design for Integrated Voltage Regulation - MICRO’14, IEEE TCAS I’18 Cross-Stack Co-Design for Rapid ASIC Design - IEEE MICRO’18, DAC’18, Hotchips’17 Future Research

Cornell University Christopher Torng 33 / 56

slide-34
SLIDE 34

Motivation Task-Based Parallelism

  • Voltage Regulation •

Rapid ASIC Design Future Research

What is a Voltage Regulator?

Voltage Regulator 0.9 V 1.8 V

Core 0 Core 1 Core 2 Cache Bank Cache Bank Cache Bank On-Chip Interconnect Core 3 Cache Bank Chip Board

Cornell University Christopher Torng 34 / 56

slide-35
SLIDE 35

Motivation Task-Based Parallelism

  • Voltage Regulation •

Rapid ASIC Design Future Research

Why is Integrated Voltage Regulation Important?

Core 0 Core 1 Core 2 Cache Bank Cache Bank Cache Bank On-Chip Interconnect Core 3 Cache Bank On-Chip Off-Chip Discrete Voltage Regulators Integrated Voltage Regulators

Key Benefit of IVR

I Reduced System Cost

Why Integrate Now?

I Technology scaling.. on-chip switches

and capacitors have gotten better

What’s the Problem?

I Integrated voltage regulators are BIG

Technology-Normalized Integrated Voltage Regulator Technology-Normalized Simple RISC Core

Cornell University Christopher Torng 35 / 56

slide-36
SLIDE 36

Motivation Task-Based Parallelism

  • Voltage Regulation •

Rapid ASIC Design Future Research

What’s taking all that area?

Vin Load Cap Cap IVR Test Chip

Cornell University Christopher Torng 36 / 56

slide-37
SLIDE 37

Motivation Task-Based Parallelism

  • Voltage Regulation •

Rapid ASIC Design Future Research

Key Architecture-Level Intuition

(Sprint) 1.5 V 20 mA

Rest

(Nominal) 1.0 V 10 mA Core C C C C Core C C C C Core C C C C Core C C C C

Cornell University Christopher Torng 37 / 56

slide-38
SLIDE 38

Motivation Task-Based Parallelism

  • Voltage Regulation •

Rapid ASIC Design Future Research

Key Idea: Dynamic Capacitance Sharing

Unit Cell Vout Control Loop Core 0 Core 1 DCS Fabric Control DCS Switch Fabric

D C B A

Cornell University Christopher Torng 38 / 56

slide-39
SLIDE 39

Motivation Task-Based Parallelism

  • Voltage Regulation •

Rapid ASIC Design Future Research

Evaluation

Normalized Energy Efficiency Speedup isopower DVFS, no sharing DVFS, with DCS No DVFS 0.6 0.8 1.0 1.2 1.4 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 DCS recovers speedup ?

Benchmarks

I bfs I bilateral I dither I kmeans I mriq I pbbs-dr I pbbs-knn I pbbs-mm I rsort I splash2-fft I splash2-lu I strsearch I viterbi I bfs I bilateral I dither I kmeans I mriq

Cornell University Christopher Torng 39 / 56

slide-40
SLIDE 40

Motivation Task-Based Parallelism

  • Voltage Regulation •

Rapid ASIC Design Future Research

SPICE-Level Transient Response

100 150 200 250 300 350 400 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 Time (ns) Voltage (V) 120 ns 150 ns 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.25 1.0 1.75 2.5 3.25 Time (us) Voltage (V) 1390 ns 960 ns 2900 ns Transient Response DVFS, no sharing Transient Response DVFS, with DCS

Cornell University Christopher Torng 40 / 56

slide-41
SLIDE 41

Motivation Task-Based Parallelism

  • Voltage Regulation •

Rapid ASIC Design Future Research

Evaluation

Normalized Energy Efficiency Speedup isopower DVFS, with DCS No DVFS 1.0 1.2 1.4 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7

10-50% Speedup and 10-70% Energy Efficiency with Area-Optimized On-Chip VRs

Benchmarks

I bfs I bilateral I dither I kmeans I mriq I pbbs-dr I pbbs-knn I pbbs-mm I rsort I splash2-fft I splash2-lu I strsearch I viterbi

Cornell University Christopher Torng 41 / 56

slide-42
SLIDE 42

Motivation Task-Based Parallelism

  • Voltage Regulation •

Rapid ASIC Design Future Research

Popping Back Up a Level

Register-Transfer Level Circuits Devices Instruction Set Architecture Programming Language Algorithm Microarchitecture Technology Application Operating System Gate Level Compiler Carbon Nanotubes Quantum Computing Molecular Computing Energy Harvesting Biodegradable Computing Phase-Change Memory AI Smart Healthcare Smart Home Graph Analytics Cybersecurity Autonomous Driving AR / VR

Computer Architecture 1 bit

Simple Parallel Runtimes Integrated Voltage Regulation

Cornell University Christopher Torng 42 / 56

slide-43
SLIDE 43

Motivation Task-Based Parallelism

  • Voltage Regulation •

Rapid ASIC Design Future Research

Dynamic Capacitance Sharing Analog Test Chip

2.2 mm 1.0 mm Control Loads Clusters

Four monolithically integrated switched-capacitor DC-DC converters with the dynamic capacitance sharing technique in 65 nm CMOS Collaboration with Waclaw Godycki, Ivan Bukreyev, and Professor Alyssa Apsel [ MICRO 2014 ] [ IEEE TCAS I 2018 ]

Cornell University Christopher Torng 43 / 56

slide-44
SLIDE 44

Motivation Task-Based Parallelism Voltage Regulation

  • Rapid ASIC Design •

Future Research

Building Future Computing Systems that Bridge Software, Architecture, and VLSI

Cross-Stack Co-Design for Task-Based Parallel Runtimes - ISCA’16, MICRO’17, RISCV’18 Cross-Stack Co-Design for Integrated Voltage Regulation - MICRO’14, IEEE TCAS I’18 Cross-Stack Co-Design for Rapid ASIC Design - IEEE MICRO’18, DAC’18, Hotchips’17 Future Research

Cornell University Christopher Torng 44 / 56

slide-45
SLIDE 45

Motivation Task-Based Parallelism Voltage Regulation

  • Rapid ASIC Design •

Future Research

Challenges in Building ASIC Prototypes

High-Level Design-Space Exploration RTL Design & Simulation Post-Synthesis Gate-Level Simulation Post-Place-and-Route Gate-Level Simulation Synthesis Floorplanning

DRC RCX LVS

Power Routing Placement Clock Tree Synthesis Routing Power Analysis Transistor-Level Sim Tape Out - 2 years later Highly Automated Standard-Cell Flow Design and Simulation

Costly in terms of...

I $$ to fabricate I $$ for licensing of IP I $$ for expertise and workforce

Time and effort...

I RTL design and verification I ASIC frontend / ASIC backend . Weeks to months for one iteration I 1.5 year timeline is too late!

By the end, the frontier of the accelerator design space will have already moved on...

Cornell University Christopher Torng 45 / 56

slide-46
SLIDE 46

Motivation Task-Based Parallelism Voltage Regulation

  • Rapid ASIC Design •

Future Research

PyMTL ASIC Tapeouts

BRGTC1 in 2016 RISC processor, 16KB SRAM HLS-generated accelerator 2x2mm, 1.2M-trans, IBM 130nm [ Poster at Hotchips 2016 ]

1.25 mm 1.0 mm

I$ Tag I$ Data Bloom Filter Accel

Shared MDU Shared FPU

L0 P L L I$ Tag I$ Data D$ Tag D$ Data D$ Tag D$ Data Core Core Core Core

BRGTC2 in 2018 Four RISC-V cores with “smart” sharing L1$/LLFU, PLL 1x1.2mm, ≈10M-trans, TSMC 28nm [ RISCV 2018 ]

Cornell University Christopher Torng 46 / 56

slide-47
SLIDE 47

Motivation Task-Based Parallelism Voltage Regulation

  • Rapid ASIC Design •

Future Research

The Celerity SoC

Target Workload: High-Performance Embedded Computing

I Multiple universities under DARPA CRAFT I 5 × 5mm in TSMC 16 nm FFC I 385 million transistors I 511 RISC-V cores . 5 Linux-capable Rocket cores . 496-core tiled manycore . 10-core low-voltage array I 1 BNN accelerator I 1 synthesizable PLL I 1 synthesizable LDO Vreg I 3 clock domains I 672-pin flip chip BGA package I 9-months from PDK access to tape-out

[ Hotchips 2017 ] [ IEEE MICRO 2018 ]

Cornell University Christopher Torng 47 / 56

slide-48
SLIDE 48

Motivation Task-Based Parallelism Voltage Regulation

  • Rapid ASIC Design •

Future Research

High-Productivity SoC Design Flow based on HLS

RISCV Rocket Core AXI Bus

Spatial Computation Array PER PER PER PER PER PER PER PER PER PER PER PER PER PER PER PER

I/O

Control Datapath Scratchpad

Processing Element

Router Interface

SerDes

Global Memory

Control Bank0 Bank1 BankN Crossbar LI Channels MatchLib

Internship at NVIDIA Research in Summer’17, led by Brucek Khailany Lightly involved in their MatchLib project under DARPA CRAFT [ DAC 2018 ]

Cornell University Christopher Torng 48 / 56

slide-49
SLIDE 49

Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design

  • Future Research •

Building Future Computing Systems that Bridge Software, Architecture, and VLSI

Cross-Stack Co-Design for Task-Based Parallel Runtimes - ISCA’16, MICRO’17, RISCV’18 Cross-Stack Co-Design for Integrated Voltage Regulation - MICRO’14, IEEE TCAS I’18 Cross-Stack Co-Design for Rapid ASIC Design - IEEE MICRO’18, DAC’18, Hotchips’17 Future Research

Cornell University Christopher Torng 49 / 56

slide-50
SLIDE 50

Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design

  • Future Research •

Future Research

Apply a cross-stack research approach to many different problems using a vertically integrated methodology

Intelligence

  • n the Edge

Tiling-Based Designs Cyber-Physical Systems

Cornell University Christopher Torng 50 / 56

slide-51
SLIDE 51

Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design

  • Future Research •

Cross-Stack Co-Design for IoT on the Edge

Managing blood transfusion, crystalloids, vasopressors Critical Care

Real-time constraints HIPAA Privacy Rule

Not enough performance!

Specialize

Software Design Hardware Design Quantization Domain hints Security hints Accelerators Compressed comm Secure hardware Fine-grain power

Software-Hardware Co-Design

I Intelligence on the edge is

required in many different application domains

I We need specialization I Concretely: Build new

accelerator-centric SoCs to enable new applications

I Pipe dreams . Emerging applications:

smart healthcare + infra

. Emerging technologies:

hybrid CMOS-TFET, emerging memories

I Modular system design

Cornell University Christopher Torng 51 / 56

slide-52
SLIDE 52

Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design

  • Future Research •

Cross-Stack Co-Design for Tiling Designs

A B C A B Days Hours Weeks NREs Design

I We need to build more

hardware and make hardware easier to build

I Do our best to avoid

monolithic designs

. GALS – pre-silicon . Chiplets – post-silicon I Concretely: Build new

accelerator-centric SoCs with a methodology that extends the tiling abstraction across the stack

Cornell University Christopher Torng 52 / 56

slide-53
SLIDE 53

Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design

  • Future Research •

Cross-Stack Co-Design for Tiling Designs

A C1 C2

1

"0" Constant Propagation B A C1 C2

1

Constant Propagation (Modular Design) B A C1 C2

1

RTL Encoding of the Tile Abstraction TileInterface "0" Generated Constraints B

I We need to build more

hardware and make hardware easier to build

I Do our best to avoid

monolithic designs

. GALS – pre-silicon . Chiplets – post-silicon I Concretely: Build new

accelerator-centric SoCs with a methodology that extends the tiling abstraction across the stack

Cornell University Christopher Torng 53 / 56

slide-54
SLIDE 54

Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design

  • Future Research •

Cross-Stack Co-Design with Cyber-Physical Systems

I Concretely: Explore new SoCs that

can be embedded into cyber-physical systems

I Pipe dreams . Inspired by projects like Harvard

RoboBee

. Architectures + cyber-physical

systems where custom acceleration and silicon prototyping can make a real difference

Cornell University Christopher Torng 54 / 56

slide-55
SLIDE 55

Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design

  • Future Research •

Acknowledgements and Funding

I

Batten Research Group: Derek Lockhart, Ji Kim, Shreesha Srinath, Berkin Ilbeyi, Moyang Wang, Shunning Jiang, Khalid Al-Hawaj, Tuan Ta, Lin Cheng, Yanghui Ou, Peitian Pan, Christopher Batten

I

Apsel Research Group: Waclaw Godycki, Ivan Bukreyev, Alyssa Apsel

I

UCSD / University of Washington: Scott Davidson, Paul Gao, Atieh Lotfi, Julian Puscar, Loai Salem, Anuj Rao, Ningxiao Sun, Luis Vega, Bandhav Veluri, Xiaoyang Wang, Shaolin Xie, Chun Zhao, Michael B. Taylor

I

University of Michigan: Tutu Ajayi, Aporva Amarnath, Austin Rovinski, Ronald G. Dreslinski

I

NVIDIA: Brucek Khailany, Evgeni Krimer, Rangharajan Venkatesan, Jason Clemons, Joel Emer, Matthew Fojtik, Alicia Klinefelter, Michael Pellauer, Nathaniel Pinckney, Yakun Sophia Shao, Shreesha Srinath, Sam (Likun) Xi, Yanqing Zhang, Brian Zimmer

I

Celerity: Tutu Ajayi, Khalid Al-Hawaj, Aporva Amarnath, Steve Dai, Scott Davidson, Paul Gao, Gai Liu, Atieh Lotfi, Julian Puscar, Anuj Rao, Austin Rovinski, Loai Salem, Ningxiao Sun, Luis Vega, Bandhav Veluri, Xiaoyang Wang, Shaolin Xie, Chun Zhao, Ritchie Zhao, Christopher Batten, Ronald G. Dreslinski, Ian Galton, Rajesh K. Gupta, Patrick P . Mercier, Mani Srivastava, Michael B. Taylor, Zhiru Zhang

I

BRGTC1/2: Shunning Jiang, Khalid Al-Hawaj, Ivan Bukreyev, Berkin Ilbeyi, Tuan Ta, Lin Cheng, Julian Puscar, Ian Galton, Moyang Wang, Bharath Sudheendra, Nagaraj Murali, Suren Jayasuriya, Shreesha Srinath, Taylor Pritchard, Robin Ying, Christopher Batten

Cornell University Christopher Torng 55 / 56

slide-56
SLIDE 56

Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design

  • Future Research •

Takeaway Points

Emerging new contexts demand much higher performance and energy efficiency Cross-stack co-design can make a real impact

I More efficient task-based parallel runtimes I Smaller area for integrated voltage regulation I Better methodologies for rapid ASIC design

I am excited to explore future cross-stack research applied to intelligence on the edge, driving methodologies for tiling-based designs, and also supporting cyber-physical systems.

Cornell University Christopher Torng 56 / 56