Software, Architecture, and VLSI Co-Design for Efficient Task-Based - - PowerPoint PPT Presentation
Software, Architecture, and VLSI Co-Design for Efficient Task-Based - - PowerPoint PPT Presentation
Software, Architecture, and VLSI Co-Design for Efficient Task-Based Parallel Runtimes Christopher Torng Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University Motivation Task-Based Parallelism
- Motivation •
Task-Based Parallelism Voltage Regulation Rapid ASIC Design Future Research
Emerging New Contexts Demand Better Hardware
Pushing Intelligence to the Edge
I Better local security I Faster response times I Lower data-movement energy I Many more...
Source: Lanner
Peak Performance Energy Efficiency TI MSP430 10+ years CR2032 coin Standby mode Inference 2.5 sec Image 28 x 28 MNIST dataset
Source: Gobieski ASPLOS'19 Cornell University Christopher Torng 2 / 56
- Motivation •
Task-Based Parallelism Voltage Regulation Rapid ASIC Design Future Research
Emerging New Contexts Demand Better Hardware
Machine Learning Graph Analytics
I Cybersecurity I Smart Healthcare I Smart Home I Augmented Reality I Virtual Reality I Autonomous Driving
How can we drastically improve performance and energy efficiency for these new emerging contexts?
Cornell University Christopher Torng 3 / 56
- Motivation •
Task-Based Parallelism Voltage Regulation Rapid ASIC Design Future Research
Motivating Trends in Computer Architecture
Transistors (Thousands) MIPS R2K Intel P4
Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten
1975 1980 1985 1990 1995 2000 2005 2010 2015 10 10
1
10
2
10
3
10
4
10
5
10
6
10
7
DEC Alpha 21264 Typical Power (W) Frequency (MHz) SPECint Performance ~9%/year ~ 1 5 % / y e a r Number
- f Cores
Intel 48-Core Prototype AMD 4-Core Opteron
Cornell University Christopher Torng 4 / 56
- Motivation •
Task-Based Parallelism Voltage Regulation Rapid ASIC Design Future Research
Excitement After Moore’s Law
Register-Transfer Level Circuits Devices Instruction Set Architecture Programming Language Algorithm Microarchitecture Technology Application Operating System Gate Level Compiler Carbon Nanotubes Quantum Computing Molecular Computing Energy Harvesting Biodegradable Computing Phase-Change Memory AI Smart Healthcare Smart Home Graph Analytics Cybersecurity Autonomous Driving AR / VR
Computer Architecture
Cornell University Christopher Torng 5 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Building Future Computing Systems that Bridge Software, Architecture, and VLSI
Cross-Stack Co-Design for Task-Based Parallel Runtimes - ISCA’16, MICRO’17, RISCV’18 Cross-Stack Co-Design for Integrated Voltage Regulation - MICRO’14, IEEE TCAS I’18 Cross-Stack Co-Design for Rapid ASIC Design - IEEE MICRO’18, DAC’18, Hotchips’17 Future Research
Cornell University Christopher Torng 6 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Building Future Computing Systems that Bridge Software, Architecture, and VLSI
Cross-Stack Co-Design for Task-Based Parallel Runtimes - ISCA’16, MICRO’17, RISCV’18 Cross-Stack Co-Design for Integrated Voltage Regulation - MICRO’14, IEEE TCAS I’18 Cross-Stack Co-Design for Rapid ASIC Design - IEEE MICRO’18, DAC’18, Hotchips’17 Future Research
Cornell University Christopher Torng 7 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Cross-Stack Co-Design for Task-Based Parallelism
Work-Stealing Runtimes
Single-ISA Heterogeneous Architectures Dynamic Voltage and Frequency Scaling
Dynamic Asymmetry Static Asymmetry
How can we use asymmetry awareness to improve the performance and energy efficiency of a work-stealing runtime?
Cornell University Christopher Torng 8 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Work-Stealing Runtimes
Work in Progress Task Queues Core 0 Core 1 Core 2 Core 3 Task C Task D Task E Task F Steal Task E Steal Task F
I Work stealing has good performance, space requirements, and
communication overheads in both theory and practice
I Supported in many popular concurrency platforms including:
Intel’s Cilk Plus, Intel’s C++ TBB, Microsoft’s .NET Task Parallel Library, Java’s Fork/Join Framework, and OpenMP
Cornell University Christopher Torng 9 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Static Asymmetry vs. Dynamic Asymmetry
Samsung Exynos Octa Mobile Processor
Big ARM Cores Little ARM Cores A7 A7 A15 A15 L2$ L2$ A7 A7 A15 A15
100 150 200 250 300 350 400 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 Time (ns) Voltage (V) 120 ns 150 ns
Integrated Voltage Regulation
Energy Performance Fmin @ Vmin Fmax @ Vmax Fnom @ Vnom
Cornell University Christopher Torng 10 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry
Single-ISA Heterogeneous Architectures Dynamic Voltage and Frequency Scaling
Bender et al. "Online Scheduling
- f Parallel Programs on
Heterogeneous Sys ..." Theory of Computing Systems 2002 Ribic et al. "Energy-Efficient Work-Stealing Language Runtimes" ASPLOS 2014 Azizi et al. "Energy-performance Tradeoffs in Processor Architecture and Circuit Design: A Marginal Cost Analysis" ISCA 2010
How can we use asymmetry awareness to improve the performance and energy efficiency of a work-stealing runtime?
Cornell University Christopher Torng 11 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
L L B B
Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry
Work-Pacing Work-Sprinting Work-Mugging
Cross-Stack Co-Design for Task-Based Parallelism Let’s start with some first-order modeling to build intuition
Cornell University Christopher Torng 12 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Building Intuition by Exploring a 1 Big 1 Little System
Normalized Power Normalized Instructions Per Second (IPS) 0.5 1.0 1.5 2.0 2.5 3.0 8 7 6 5 4 3 2 1 L B System with 1 big 1 little Little Core Four-Issue Big Core (2.0, 6.0) (1.0, 1.0) B L 7.0 P
- w
e r B L 3.0 I P S B L L B B L L B
Same Power 10% Performance Increase 10% Energy Efficiency Increase
Cornell University Christopher Torng 13 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
The Law of Equi-Marginal Utility
Normalized Instructions Per Second (IPS) 0.5 1.0 1.5 2.0 2.5 3.0 8 7 6 5 4 3 2 1
Alfred Marshall (1824 - 1924)
"Other things being equal, a consumer gets maximum satisfaction when he allocates his limited income to the purchase of different goods in such a way that the Marginal Utility derived from the last unit of money spent on each item of expenditure tend to be equal."
British Economist
Normalized Power
Balance the ratio of utility (IPS) to cost (power)
Slope Slope 0.9 V 1.3 V
Arbitrage "Buy Low, Sell High"
Cornell University Christopher Torng 14 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Systematic Approach for Balancing Marginal Utility
Normalized Energy Efficiency Normalized IPS 0.6 0.8 1.0 1.2 1.4 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 isopower
1 Big 1 Little System at Nominal voltage
Assumptions
Individual (VB, VL) pair
Perfectly parallel application Ideal load balancing
Pareto-Optimal Frontier
Marginal Utility-Based Optimization Problem Constraint: isopower line Objective: maximize performance (Solved numerically)
Cornell University Christopher Torng 15 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
L L B B
Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry
Work-Pacing Work-Sprinting Work-Mugging
Cross-Stack Co-Design for Task-Based Parallelism Let’s explore three specific techniques to balance marginal utility in a work-stealing runtime
Cornell University Christopher Torng 16 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Work-Pacing: Building Intuition
Balance performance/power across cores in the high-parallel (HP) region
L L B B
Busy Steal Loop 2 Big, 2 Little
System with both big cores active and both little cores active B L
Normalized Power Normalized IPS
0.0 0.5 1.0 1.5 2.0 2.5 3.0 1 2 3 4 5 6 7
VL VB Marginal IPS/W
0.70 1.00 1.30 1.60 1.90 1.04 1.00 0.92 0.76 0.24
Aggregate Throughput
Aggregate System IPS
0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.70 1.00 1.30 1.60 1.90 1.04 1.00 0.92 0.76 0.24
VL VB
B L
Cornell University Christopher Torng 17 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Work-Pacing, Work-Sprinting, and Work-Mugging
L L B B
Steal Loop Busy Work-Pacing Balance performance/power across cores in the high-parallel (HP) region Rest cores in the steal loop to the lowest voltage With additional power slack, balance performance/power across busy cores in the low-parallel (LP) region Work-Sprinting Work-Mugging Move work from slow little cores to fast big cores in the low-parallel (LP) region Inspired by theoretical work - Bender et al. Theory of Computing '02
Cornell University Christopher Torng 18 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
L L B B
Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry
Work-Pacing Work-Sprinting Work-Mugging
Cross-Stack Co-Design for Task-Based Parallelism We have three techniques for balancing marginal utility (but we’re missing something)
Cornell University Christopher Torng 19 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Augmenting the Software/Architecture Interface
L L B B
Steal Loop Busy
Register-Transfer Level Circuits Devices Instruction Set Architecture Programming Language Algorithm Microarchitecture Technology Application Operating System Gate Level Compiler
L L B B L L B B
Annotate the work-stealing runtime with hints
Cornell University Christopher Torng 20 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Popping Back Up a Level
Register-Transfer Level Circuits Devices Instruction Set Architecture Programming Language Algorithm Microarchitecture Technology Application Operating System Gate Level Compiler Carbon Nanotubes Quantum Computing Molecular Computing Energy Harvesting Biodegradable Computing Phase-Change Memory AI Smart Healthcare Smart Home Graph Analytics Cybersecurity Autonomous Driving AR / VR
Computer Architecture
Task-Based Parallel Runtimes Integrated Voltage Regulation
Cornell University Christopher Torng 21 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
L L B B
Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry
Work-Pacing Work-Sprinting Work-Mugging
Cross-Stack Co-Design for Task-Based Parallelism For the detail-minded, here are the specific mechanisms
Cornell University Christopher Torng 22 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Work-Pacing and Work-Sprinting Mechanisms
On-Chip Interconnect L1$ L1$ B L Voltage Regulators DRAM Memory Controller L1$ L1$ B L Shared L2$ Cache Banks DVFS Controller
Work in Progress Task Queues
Big Big Little Little
Which cores are stealing? Big or little?
Task A Task B
Hints D e c i s i
- n
Vdd
B L B L
A A A A
VB VL A = Active, S = Stealing
0.91V 1.30V
A A A s
0.98V 1.30V 0.70V
Stealing
Task C
A A s s A s A A A s A s A s s s s s A A s s A s s s s s
1.03V 1.30V 1.04V 1.30V 1.13V 1.30V 1.21V 0.70V 0.70V 1.30V 0.70V 1.30V 0.70V 0.70V 0.70V 0.70V 0.70V 0.70V 0.70V 0.70V 0.70V
Activity Pattern Voltages
Cornell University Christopher Torng 23 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Work-Mugging Mechanisms
On-Chip Interconnect L1$ L1$ B L DRAM Memory Controller L1$ L1$ B L Shared L2$ Cache Banks Voltage Regulators DVFS Controller User-Level Interrupt Network Initiate Mug Mug Interrupt Thread Context Swap
Mug Instruction
I Thread ID to mug I Address of thread-swapping handler
User-Level Interrupt Network
I Simple, low-bandwidth inter-core network I Latency on order of 20 cycles
Thread Context Swap
I Threads store architectural state to
separate locations in shared memory
I Both threads sync I Threads load architectural state from
- ther location
Cornell University Christopher Torng 24 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
L L B B
Work-Stealing Runtimes Dynamic Asymmetry Static Asymmetry
Work-Pacing Work-Sprinting Work-Mugging
Cross-Stack Co-Design for Task-Based Parallelism Let’s see an asymmetry-aware work-stealing runtime in action
Cornell University Christopher Torng 25 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Evaluation Methodology: Modeling
Work-Stealing Runtime
I State-of-the-art Intel TBB-inspired work-stealing scheduler I Chase-Lev task queues with occupancy-based victim selection I Instrumented with activity hints
Cycle-Level Modeling
I Heterogeneous system modeled in gem5 cycle-approximate simulator I Support for scaling per-core frequencies + central DVFS Controller
Energy Modeling
I Event-based energy modeling based on detailed RTL/gate-level sims
(Synopsys ASIC toolflow, TSMC LP , 65 nm 1.0 V)
I Carefully selected subset of McPAT results tuned to our µarchitecture
Cornell University Christopher Torng 26 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Work-Pacing in cilk-sort
time Big Little
No AAWS Techniques Busy Steal Loop Activity Bar DVFS Controller Decision 0.70 V 1.04 V 1.24 V 1.00 V 1.30 V 0.90 V Work-Pacing
Big Little time
Speedup
- f 1.11x
Normalized Energy Efficiency 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Performance 1.0 1.2 1.1 1.3 1.4 1.5 isopower
Speedup 1.11x Energy Efficiency 1.11x
Cornell University Christopher Torng 27 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Work-Sprinting in quicksort
1.00 V 0.70 V 1.04 V 1.24 V Busy Steal Loop 1.30 V 1.10 V
time Big Little
No AAWS Techniques DVFS Controller Decision Activity Bar Work-Sprinting
Big Little time
Speedup
- f 1.34x
Normalized Energy Efficiency 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Performance 1.0 1.2 1.1 1.3 1.4 1.5 isopower
Speedup 1.34x Energy Efficiency 1.16x
Cornell University Christopher Torng 28 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Work-Mugging in radix sort
Big Little
No AAWS Techniques
time
Busy Steal Loop Activity Bar DVFS Controller Decision 1.00 V 0.90 V 0.70 V 1.04 V 1.24 V 1.30 V Work-Mugging
Big Little time
Speedup 1.17x
Normalized Energy Efficiency 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Performance 1.0 1.2 1.1 1.3 1.4 1.5 isopower
Speedup 1.17x Energy Efficiency 1.40x
Cornell University Christopher Torng 29 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Evaluation of Complete AAWS Runtime
0.0 0.2 0.4 0.6 0.8
Normalized Energy Efficiency
0.9 1.0 1.1 1.2 1.3 1.4 1.5
Performance
1.0 1.2 1.1 1.3 1.4 1.5 isopower
pbbs-bfs pbbs-quicksort pbbs-samplesort pbbs-dictionary pbbs-convex-hull pbbs-radix-sort pbbs-knn pbbs-max-independent-set pbbs-nbody pbbs-remove-duplicates pbbs-suffix-array pbbs-spanning-tree cilk-cholesky cilk-cilksort cilk-heat cilk-knapsack cilk-matrix-multiply parsec-blackscholes unbalanced-tree-search
Application Kernels
Median: 1.10 x Max: 1.32 x Performance Median: 1.11 x Max: 1.53 x Energy Efficiency
Cornell University Christopher Torng 30 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Popping Back Up a Level
Register-Transfer Level Circuits Devices Instruction Set Architecture Programming Language Algorithm Microarchitecture Technology Application Operating System Gate Level Compiler Carbon Nanotubes Quantum Computing Molecular Computing Energy Harvesting Biodegradable Computing Phase-Change Memory AI Smart Healthcare Smart Home Graph Analytics Cybersecurity Autonomous Driving AR / VR
Computer Architecture
Task-Based Parallel Runtimes Integrated Voltage Regulation
1 bit
Cornell University Christopher Torng 31 / 56
Motivation
- Task-Based Parallelism •
Voltage Regulation Rapid ASIC Design Future Research
Prototyping to Support Research Results
1.25 mm 1.0 mm
I$ Tag I$ Data
Bloom Filter Accel
Shared MDU Shared FPU
L0 P L L I$ Tag I$ Data D$ Tag D$ Data D$ Tag D$ Data Core Core Core Core
Batten Research Group Test Chip 2 Digital Test Chip, TSMC 28 nm Static Timing Analysis @ 500 MHz
I Completed in 2 months I Runs a work-stealing
runtime (RISC-V XCC)
I Four RISC-V cores +
32kB L1 caches
I Aggressively shared
long-latency resources
I Microarchitectural smart
sharing mechanisms
I Synthesizable PLL
Results supporting multiple paper submissions to top-tier venues
Cornell University Christopher Torng 32 / 56
Motivation Task-Based Parallelism
- Voltage Regulation •
Rapid ASIC Design Future Research
Building Future Computing Systems that Bridge Software, Architecture, and VLSI
Cross-Stack Co-Design for Task-Based Parallel Runtimes - ISCA’16, MICRO’17, RISCV’18 Cross-Stack Co-Design for Integrated Voltage Regulation - MICRO’14, IEEE TCAS I’18 Cross-Stack Co-Design for Rapid ASIC Design - IEEE MICRO’18, DAC’18, Hotchips’17 Future Research
Cornell University Christopher Torng 33 / 56
Motivation Task-Based Parallelism
- Voltage Regulation •
Rapid ASIC Design Future Research
What is a Voltage Regulator?
Voltage Regulator 0.9 V 1.8 V
Core 0 Core 1 Core 2 Cache Bank Cache Bank Cache Bank On-Chip Interconnect Core 3 Cache Bank Chip Board
Cornell University Christopher Torng 34 / 56
Motivation Task-Based Parallelism
- Voltage Regulation •
Rapid ASIC Design Future Research
Why is Integrated Voltage Regulation Important?
Core 0 Core 1 Core 2 Cache Bank Cache Bank Cache Bank On-Chip Interconnect Core 3 Cache Bank On-Chip Off-Chip Discrete Voltage Regulators Integrated Voltage Regulators
Key Benefit of IVR
I Reduced System Cost
Why Integrate Now?
I Technology scaling.. on-chip switches
and capacitors have gotten better
What’s the Problem?
I Integrated voltage regulators are BIG
Technology-Normalized Integrated Voltage Regulator Technology-Normalized Simple RISC Core
Cornell University Christopher Torng 35 / 56
Motivation Task-Based Parallelism
- Voltage Regulation •
Rapid ASIC Design Future Research
What’s taking all that area?
Vin Load Cap Cap IVR Test Chip
Cornell University Christopher Torng 36 / 56
Motivation Task-Based Parallelism
- Voltage Regulation •
Rapid ASIC Design Future Research
Key Architecture-Level Intuition
(Sprint) 1.5 V 20 mA
Rest
(Nominal) 1.0 V 10 mA Core C C C C Core C C C C Core C C C C Core C C C C
Cornell University Christopher Torng 37 / 56
Motivation Task-Based Parallelism
- Voltage Regulation •
Rapid ASIC Design Future Research
Key Idea: Dynamic Capacitance Sharing
Unit Cell Vout Control Loop Core 0 Core 1 DCS Fabric Control DCS Switch Fabric
D C B A
Cornell University Christopher Torng 38 / 56
Motivation Task-Based Parallelism
- Voltage Regulation •
Rapid ASIC Design Future Research
Evaluation
Normalized Energy Efficiency Speedup isopower DVFS, no sharing DVFS, with DCS No DVFS 0.6 0.8 1.0 1.2 1.4 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 DCS recovers speedup ?
Benchmarks
I bfs I bilateral I dither I kmeans I mriq I pbbs-dr I pbbs-knn I pbbs-mm I rsort I splash2-fft I splash2-lu I strsearch I viterbi I bfs I bilateral I dither I kmeans I mriq
Cornell University Christopher Torng 39 / 56
Motivation Task-Based Parallelism
- Voltage Regulation •
Rapid ASIC Design Future Research
SPICE-Level Transient Response
100 150 200 250 300 350 400 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 Time (ns) Voltage (V) 120 ns 150 ns 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.25 1.0 1.75 2.5 3.25 Time (us) Voltage (V) 1390 ns 960 ns 2900 ns Transient Response DVFS, no sharing Transient Response DVFS, with DCS
Cornell University Christopher Torng 40 / 56
Motivation Task-Based Parallelism
- Voltage Regulation •
Rapid ASIC Design Future Research
Evaluation
Normalized Energy Efficiency Speedup isopower DVFS, with DCS No DVFS 1.0 1.2 1.4 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7
10-50% Speedup and 10-70% Energy Efficiency with Area-Optimized On-Chip VRs
Benchmarks
I bfs I bilateral I dither I kmeans I mriq I pbbs-dr I pbbs-knn I pbbs-mm I rsort I splash2-fft I splash2-lu I strsearch I viterbi
Cornell University Christopher Torng 41 / 56
Motivation Task-Based Parallelism
- Voltage Regulation •
Rapid ASIC Design Future Research
Popping Back Up a Level
Register-Transfer Level Circuits Devices Instruction Set Architecture Programming Language Algorithm Microarchitecture Technology Application Operating System Gate Level Compiler Carbon Nanotubes Quantum Computing Molecular Computing Energy Harvesting Biodegradable Computing Phase-Change Memory AI Smart Healthcare Smart Home Graph Analytics Cybersecurity Autonomous Driving AR / VR
Computer Architecture 1 bit
Simple Parallel Runtimes Integrated Voltage Regulation
Cornell University Christopher Torng 42 / 56
Motivation Task-Based Parallelism
- Voltage Regulation •
Rapid ASIC Design Future Research
Dynamic Capacitance Sharing Analog Test Chip
2.2 mm 1.0 mm Control Loads Clusters
Four monolithically integrated switched-capacitor DC-DC converters with the dynamic capacitance sharing technique in 65 nm CMOS Collaboration with Waclaw Godycki, Ivan Bukreyev, and Professor Alyssa Apsel [ MICRO 2014 ] [ IEEE TCAS I 2018 ]
Cornell University Christopher Torng 43 / 56
Motivation Task-Based Parallelism Voltage Regulation
- Rapid ASIC Design •
Future Research
Building Future Computing Systems that Bridge Software, Architecture, and VLSI
Cross-Stack Co-Design for Task-Based Parallel Runtimes - ISCA’16, MICRO’17, RISCV’18 Cross-Stack Co-Design for Integrated Voltage Regulation - MICRO’14, IEEE TCAS I’18 Cross-Stack Co-Design for Rapid ASIC Design - IEEE MICRO’18, DAC’18, Hotchips’17 Future Research
Cornell University Christopher Torng 44 / 56
Motivation Task-Based Parallelism Voltage Regulation
- Rapid ASIC Design •
Future Research
Challenges in Building ASIC Prototypes
High-Level Design-Space Exploration RTL Design & Simulation Post-Synthesis Gate-Level Simulation Post-Place-and-Route Gate-Level Simulation Synthesis Floorplanning
DRC RCX LVS
Power Routing Placement Clock Tree Synthesis Routing Power Analysis Transistor-Level Sim Tape Out - 2 years later Highly Automated Standard-Cell Flow Design and Simulation
Costly in terms of...
I $$ to fabricate I $$ for licensing of IP I $$ for expertise and workforce
Time and effort...
I RTL design and verification I ASIC frontend / ASIC backend . Weeks to months for one iteration I 1.5 year timeline is too late!
By the end, the frontier of the accelerator design space will have already moved on...
Cornell University Christopher Torng 45 / 56
Motivation Task-Based Parallelism Voltage Regulation
- Rapid ASIC Design •
Future Research
PyMTL ASIC Tapeouts
BRGTC1 in 2016 RISC processor, 16KB SRAM HLS-generated accelerator 2x2mm, 1.2M-trans, IBM 130nm [ Poster at Hotchips 2016 ]
1.25 mm 1.0 mm
I$ Tag I$ Data Bloom Filter Accel
Shared MDU Shared FPU
L0 P L L I$ Tag I$ Data D$ Tag D$ Data D$ Tag D$ Data Core Core Core Core
BRGTC2 in 2018 Four RISC-V cores with “smart” sharing L1$/LLFU, PLL 1x1.2mm, ≈10M-trans, TSMC 28nm [ RISCV 2018 ]
Cornell University Christopher Torng 46 / 56
Motivation Task-Based Parallelism Voltage Regulation
- Rapid ASIC Design •
Future Research
The Celerity SoC
Target Workload: High-Performance Embedded Computing
I Multiple universities under DARPA CRAFT I 5 × 5mm in TSMC 16 nm FFC I 385 million transistors I 511 RISC-V cores . 5 Linux-capable Rocket cores . 496-core tiled manycore . 10-core low-voltage array I 1 BNN accelerator I 1 synthesizable PLL I 1 synthesizable LDO Vreg I 3 clock domains I 672-pin flip chip BGA package I 9-months from PDK access to tape-out
[ Hotchips 2017 ] [ IEEE MICRO 2018 ]
Cornell University Christopher Torng 47 / 56
Motivation Task-Based Parallelism Voltage Regulation
- Rapid ASIC Design •
Future Research
High-Productivity SoC Design Flow based on HLS
RISCV Rocket Core AXI Bus
Spatial Computation Array PER PER PER PER PER PER PER PER PER PER PER PER PER PER PER PER
I/O
Control Datapath Scratchpad
Processing Element
Router Interface
SerDes
Global Memory
Control Bank0 Bank1 BankN Crossbar LI Channels MatchLib
Internship at NVIDIA Research in Summer’17, led by Brucek Khailany Lightly involved in their MatchLib project under DARPA CRAFT [ DAC 2018 ]
Cornell University Christopher Torng 48 / 56
Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design
- Future Research •
Building Future Computing Systems that Bridge Software, Architecture, and VLSI
Cross-Stack Co-Design for Task-Based Parallel Runtimes - ISCA’16, MICRO’17, RISCV’18 Cross-Stack Co-Design for Integrated Voltage Regulation - MICRO’14, IEEE TCAS I’18 Cross-Stack Co-Design for Rapid ASIC Design - IEEE MICRO’18, DAC’18, Hotchips’17 Future Research
Cornell University Christopher Torng 49 / 56
Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design
- Future Research •
Future Research
Apply a cross-stack research approach to many different problems using a vertically integrated methodology
Intelligence
- n the Edge
Tiling-Based Designs Cyber-Physical Systems
Cornell University Christopher Torng 50 / 56
Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design
- Future Research •
Cross-Stack Co-Design for IoT on the Edge
Managing blood transfusion, crystalloids, vasopressors Critical Care
Real-time constraints HIPAA Privacy Rule
Not enough performance!
Specialize
Software Design Hardware Design Quantization Domain hints Security hints Accelerators Compressed comm Secure hardware Fine-grain power
Software-Hardware Co-Design
I Intelligence on the edge is
required in many different application domains
I We need specialization I Concretely: Build new
accelerator-centric SoCs to enable new applications
I Pipe dreams . Emerging applications:
smart healthcare + infra
. Emerging technologies:
hybrid CMOS-TFET, emerging memories
I Modular system design
Cornell University Christopher Torng 51 / 56
Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design
- Future Research •
Cross-Stack Co-Design for Tiling Designs
A B C A B Days Hours Weeks NREs Design
I We need to build more
hardware and make hardware easier to build
I Do our best to avoid
monolithic designs
. GALS – pre-silicon . Chiplets – post-silicon I Concretely: Build new
accelerator-centric SoCs with a methodology that extends the tiling abstraction across the stack
Cornell University Christopher Torng 52 / 56
Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design
- Future Research •
Cross-Stack Co-Design for Tiling Designs
A C1 C2
1
"0" Constant Propagation B A C1 C2
1
Constant Propagation (Modular Design) B A C1 C2
1
RTL Encoding of the Tile Abstraction TileInterface "0" Generated Constraints B
I We need to build more
hardware and make hardware easier to build
I Do our best to avoid
monolithic designs
. GALS – pre-silicon . Chiplets – post-silicon I Concretely: Build new
accelerator-centric SoCs with a methodology that extends the tiling abstraction across the stack
Cornell University Christopher Torng 53 / 56
Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design
- Future Research •
Cross-Stack Co-Design with Cyber-Physical Systems
I Concretely: Explore new SoCs that
can be embedded into cyber-physical systems
I Pipe dreams . Inspired by projects like Harvard
RoboBee
. Architectures + cyber-physical
systems where custom acceleration and silicon prototyping can make a real difference
Cornell University Christopher Torng 54 / 56
Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design
- Future Research •
Acknowledgements and Funding
I
Batten Research Group: Derek Lockhart, Ji Kim, Shreesha Srinath, Berkin Ilbeyi, Moyang Wang, Shunning Jiang, Khalid Al-Hawaj, Tuan Ta, Lin Cheng, Yanghui Ou, Peitian Pan, Christopher Batten
I
Apsel Research Group: Waclaw Godycki, Ivan Bukreyev, Alyssa Apsel
I
UCSD / University of Washington: Scott Davidson, Paul Gao, Atieh Lotfi, Julian Puscar, Loai Salem, Anuj Rao, Ningxiao Sun, Luis Vega, Bandhav Veluri, Xiaoyang Wang, Shaolin Xie, Chun Zhao, Michael B. Taylor
I
University of Michigan: Tutu Ajayi, Aporva Amarnath, Austin Rovinski, Ronald G. Dreslinski
I
NVIDIA: Brucek Khailany, Evgeni Krimer, Rangharajan Venkatesan, Jason Clemons, Joel Emer, Matthew Fojtik, Alicia Klinefelter, Michael Pellauer, Nathaniel Pinckney, Yakun Sophia Shao, Shreesha Srinath, Sam (Likun) Xi, Yanqing Zhang, Brian Zimmer
I
Celerity: Tutu Ajayi, Khalid Al-Hawaj, Aporva Amarnath, Steve Dai, Scott Davidson, Paul Gao, Gai Liu, Atieh Lotfi, Julian Puscar, Anuj Rao, Austin Rovinski, Loai Salem, Ningxiao Sun, Luis Vega, Bandhav Veluri, Xiaoyang Wang, Shaolin Xie, Chun Zhao, Ritchie Zhao, Christopher Batten, Ronald G. Dreslinski, Ian Galton, Rajesh K. Gupta, Patrick P . Mercier, Mani Srivastava, Michael B. Taylor, Zhiru Zhang
I
BRGTC1/2: Shunning Jiang, Khalid Al-Hawaj, Ivan Bukreyev, Berkin Ilbeyi, Tuan Ta, Lin Cheng, Julian Puscar, Ian Galton, Moyang Wang, Bharath Sudheendra, Nagaraj Murali, Suren Jayasuriya, Shreesha Srinath, Taylor Pritchard, Robin Ying, Christopher Batten
Cornell University Christopher Torng 55 / 56
Motivation Task-Based Parallelism Voltage Regulation Rapid ASIC Design
- Future Research •
Takeaway Points
Emerging new contexts demand much higher performance and energy efficiency Cross-stack co-design can make a real impact
I More efficient task-based parallel runtimes I Smaller area for integrated voltage regulation I Better methodologies for rapid ASIC design
I am excited to explore future cross-stack research applied to intelligence on the edge, driving methodologies for tiling-based designs, and also supporting cyber-physical systems.
Cornell University Christopher Torng 56 / 56