Paul G. Allen School of Computer Science & Engineering University of Washington
Luis Ceze
Approximate Computing and Storage from Programming Language to Hardware and Molecules
Approximate Computing and Storage from Programming Language to - - PowerPoint PPT Presentation
Approximate Computing and Storage from Programming Language to Hardware and Molecules Luis Ceze Paul G. Allen School of Computer Science & Engineering University of Washington IEEE JOURN.4L OF SOLID-ST.iTE CIRCUITS, 5> OCTOBER 1974 2456
Paul G. Allen School of Computer Science & Engineering University of Washington
Luis Ceze
Approximate Computing and Storage from Programming Language to Hardware and Molecules
Moore’s law gives us lots
But it is Dennard scaling that lets us use them: 2x transistor count, 40% faster, 50% more efficient…
IEEE JOURN.4L OF SOLID-ST.iTE CIRCUITS, VOL. SC-9, NO. 5> OCTOBER 1974 2456 [31 [41 [5’1 [61 [71 [81 [91 [101 [111# of transistors power
... ...
generations power density
“Dark silicon”
(boo!)
Can’t have all transistors on!
Explicit parallelism is important, but not a longer term solution…
With Dennard scaling dead (~2005-8), power per transistor stays constant…
“The number of people predicting the end of Moore’s Law doubles every year.” – Doug Carmean (Intel->MSFT)
And… The Economist says…
So how do we make computer systems better?
Unavoidable that we need to exploit application properties and specialize.
Source: Bob Broderson, Berkeley Wireless group
YouTube: 100 hours of video uploaded per
views a day.
One trillion photos were taken in 2015. Compared to 80 billion in 2000, 12.5x over 15 years
These applications consume most of our cycles/bytes/bandwidth Input data is inexact by nature (sensors) They have multiple acceptable outputs
What is “approximate computing”?
Exploit inherent application-level resilience/ redundancies to build more efficient/better computers. Output accuracy Efficiency and performance
Physics Circuits ISA/Architecutre Compiler Language Application
In essence, goal is to specialize computation, storage and communication to properties of the data and the algorithm. Squeezes excess precision and enables better use of underlying substrate.
Wait, what about… :)
Applications/ Algorithms
Floating point… Machine learning Iterative algorithms Lossy compression ~4X ~5X ~10X ~10-100X
Physics Circuits ISA/Architecture Compiler Language
Non-deterministic “unsafe” circuits Analog hardware (closer to physics) Approximate execution models Reasoning about approximation in PL Approximate compiler optimizations
(voltage scaling, mem)
Energy Errors Energy Errors Energy Errors Energy Errors
Acceptability
0% 20% 40% 60% 80% 100%
Energy (abstract units)
100 200 300 400 500
Without Trade-Off :( With Trade-Off :)
Ground-Truth Quality
from Crowdsourced Opinions
But approximation needs to be done carefully... or...
HW/SW co-design is essential…
Approximation just at the hardware level isn’t safe. Approximation just at the algorithm level is suboptimal. Assuming reliable hardware for inherently robust algorithms is a waste.
What and how to approximate? How good is my output? How to take advantage of it?
Hardware Runtime Compiler Language
Three key questions
“Disciplined” approximate programming
Precise Approximate
references jump targets JPEG header pixel data neuron weights audio samples video frames
EnerJ/C++
Separate critical and non-critical program components. Analyzable statically.
Precise Approximate
✗
✓
int a = ...; int p = ...; @Approx @Precise p = a; a = p;
[PLDI’11, OOPSLA’15]
int a = ...;
@Approx<0.9>
int p = ...; @Precise
p + p; p + a; a + a;
if ( p = 2; } a > 10)
if ( p = 2; } a > 10 ) { endorse( ) endorse(a); p =
✓
How good is my output?
– e.g, % of bad pixels, deviation from expected value, % of poorly classified images, car crashes, etc…
Specifying and checking quality
res = computeSomething(); assert diff(res, resʹ) < 0.1;
precise version of the result
passert expr, prob, conf
Verifying quality expressions
+ input and error distribution Bayesian network IR
Bayes net sampling exact evaluation
[PLDI’14]
int p = 5; @Approx int a = 7; for (int x = 0..) { a += func(2); @Approx int z; z = p * 2; p += 4; } a /= 9; p += 10; socket.send(z); write(file, z);
Online quality monitoring
Can react – recompute or reduce approximation But needs to be cheap!
Sampled precise re-execution Simple verification functions Fuzzy Memoization
<ε? <ε?
[ASPLOS’15]
How to take advantage of approximation?
CPU
Approximate functional units, data path, registers, caches. Approximate on-chip interconnect? Mixed-mode functional units? Approximate parallelization? Approximate accelerators.
Acc CPU
Hardware
Dual- Voltage Core
VDDH VDDL
Dual-voltage approximate CPU
[ASPLOS 2012]
Fetch Decode Reg Read Execute Memory WB
Instruction Cache ITLB Decoder Register File Integer FU FP FU Data Cache DTLB Register File
replicated functional units dual-voltage SRAM arrays
Dual- Voltage Core
VDDH VDDL
Dual-voltage approximate CPU
[ASPLOS 2012]
Compiler
ld 0x04 r1 ld 0x08 r2 add.a r1 r2 r3 st.a 0x0c r3
7–24% energy saved on average
(fft, game engines, raytracing, QR code readers, etc) (scope: processor + memory)
not good... :(
(though better implementations likely)
Amdahl’s law... damn!
Fetch Decode Reg Read Execute Memory Write Back
Branch Predictor Instruction Cache ITLB Decoder Register File Integer FU FP FU Data Cache DTLB Register File
CPU
Very efficient hardware implementations! Trainable to mimic many computations! Recall is imprecise.
Neural Networks as Approximate Accelerators
Fault tolerant
[Temam, ISCA 2012][MICRO 2012, ISCA 2014, HPCA’15, DATE’18 ]
Program
Neural acceleration
Neural acceleration
Program
Find an approximate program component
Program
Compile the program and train a neural network
Neural acceleration
Find an approximate program component
Program
Compile the program and train a neural network Execute on a fast Neural Processing Unit (NPU)
Neural acceleration
Find an approximate program component
An example: Sobel filter
@approx float grad(approx float[3][3] p) { … }
edgeDetection()
void edgeDetection(aImage &src, aImage &dst) { for (int y = …) { for (int x = …) { dst[x][y] = grad(window(src, x, y)); } } }
@approx float dst[][];
Approximable
√
Well-defined inputs and outputs
√
Neural Processing Unit
Core NPU
input
configuration enq.c deq.c enq.d deq.d
A canonical Neural Processing Unit
Bus Scheduler
Processing Engines
input
scheduling
Summary of NPU results
CPU F D X I M C
NPU
0.8x - 11.1x (3x mean) speedup 1.1x - 21x (3x mean) energy red.
CPU
FP G A
1.3x - 38x (3.8x mean) speedup 0.9x - 28x (2.8x mean) energy red.
I(xi) I(xn) I(x0) R(w0) R(wi) R(wn) ADC X (I(xi)R(wi)) y ≈ sigmoid( X (I(xi)R(wi))) DAC DAC DAC x0 xi xn … … (b)0.9x - 24x (3.7x mean) speedup 1.5x - 51x (6.8x mean) energy red.
application domain error metric blackscholes option pricing MSE fft DSP MSE inversek2j robotics MSE jmeint 3D-modeling miss rate jpeg compression image diff kmeans ML image diff sobel vision image diff
NPU AFU IMEM FIFO Memory Arbiter OpenMSP430 Shared DMEM GPIO & Serial SRAM SRAM SRAM+
SRAM SRAM+
AFU LUT + AFU_OUT AFU_OUT DMEM_OUT ACCUM ACCUM...
PE Systolic Array FIFO...
Control Weight SRAM PE 0 MAC Weight SRAM PE 7 MACApproximate Program Synthesis
Precise Implementation Desired Quality Approximate Program
float dist_approx(int a[3], int b[3]) { int c1 = abs(b[0] - a[0]); int c2 = abs(b[1] - a[1]); int c3 = abs(a[2] - b[2]); int c4 = c1 | c2; int c5 = abs(c3 > c4 ? c3 : c4); return (float)c5; }
3D Euclidean distance 1.6× faster, 14.9% error
Automatically discover the most efficient acceptable program [POPL’16]
ACCEPT, a framework for approximate programming
Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration
Deep Learning Compute Requirements is Steadily Growing
Source: Eugenio Culurciello, An Analysis of Deep Neural Network Models for Practical Applications, arXiv:1605.07678 parameter size
Neural Networks and Quantization
Specialized approximation: understand how parameters are generated/ trained, and adjust them accordingly to compensate for approximations
Application
Original Parameters
Approximate Application
Adjusted Parameters
Accuracy of a MLP (784-128-64-10) trained on MNIST
Validation Accuracy (%) 0% 20% 40% 60% 80% 100% Weight Precision (bits) 8 7 6 5 4 3 2 1 96.78% precise Quantize after Training Quantize during Trainingmuch more graceful quality degradation
Pareto-Optimization Challenge
Accuracy Throughput
NN Model, Topology Parameter Compression Operators Compiler Optimization Architectural Exploration
Search space grows exponentially!
Metal-Learning Preliminary Study
Proof-of-concept study: Explore impact of topology exploration, parameter compression and architectural exploration on accuracy vs. performance (MNIST + MLP)
#43
Validation Accuracy (%) 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 Inference time (ms) 0.5 1 1.5 2
8-bit 7-bit 6-bit 5-bit 4-bit 3-bit 2-bit
knee of the pareto frontierNN Model, Topology Parameter Compression Architectural Exploration
Kim et al. MATIC [DATE’18 Best Paper]
NPU AFU IMEM FIFO Memory Arbiter OpenMSP430 Shared DMEM GPIO & Serial SRAM SRAM SRAM+
SRAM SRAM+
AFU LUT + AFU_OUT AFU_OUT DMEM_OUT ACCUM ACCUM...
PE Systolic Array FIFO...
Control Weight SRAM PE 0 MAC Weight SRAM PE 7 MACTape-out of the SNNAP + MSP430 design Use SRAM voltage scaling as a knob to reduce power
Rely on memory-adaptive training to gracefully degrade application error
Normalized Error Voltage
Automatic Generation of HW-SW Stack for Deep Learning
Addressing the Programmability Challenge with TVM
VTA FPGA Design Runtime & JIT Compiler FPGA/SoC Drivers RPC Layer TVM Compiler NNVM Graph High-Level Deep Learning Framework
TVM DSL allows for separation of schedule and algorithm
tvmlang.org
Another important trend…
12,500 25,000 37,500 50,000 2005 2007 2009 2011 2013 2015 2017 2019 Data Generated Installed Capacity
Data: IDC Storage (exabytes)
Mostly videos and images
Fast Dense
Accurate
high low
probability
00 11 01 10
Precise
high low
probability
000 011 001 010 100 111 101 110
Approximate
[MICRO’13, ASPLOS’16, ASPLOS’17]
2x gain in density and performance JPEG-XR H.264
01010 10100 0101010100010101001011010 10101010011001001001010
Approximate Storage
Molecular data storage: few atoms per bit All major storage technologies approaching their limits SMR, HAMR, bit patterning,? 3-4 scaling steps more? Going vertical, limit? Limited by wavelength of light
10100011100 100011110011 11100010110 01010010111 101…
Manufactured DNA
Using Synthetic DNA for Data Storage
Photo: Tara Brown / UW
~10+ TB
1 Exabyte in 1 in3
Extremely Dense
100,000s of Years
Extremely Durable Never Gets Obsolete
DNA Molecules for Digital Data
Making Copies Is Nearly Free
Write Path
1 1 0 1 1 1 0 1 Encoding
1
A G C TAT C A G Synthesis
2
Read Path
Sequencing A G C TAT C A G
1
1 1 0 1 1 1 0 1 Decoding
2
Molecular domain Electronic domain
Hardware, Software, Wetware
Approximation Electric signals Molecular signals and processes (proteins, ions) Physical circuit plasticity Approximation and moving parts are an inherent advantage of natural computing!
Computer architecture, coding theory, molecular biology, fluidics, automation Computer architecture, programming languages, operating systems and machine learning
sa pa
sa pa
Luis Ceze University of Washington
luisceze@cs.washington.edu
3D-Printed WiFi UW Reality Lab for AR/VR Research Battery-Free Cellphone Mobile Health DNA Data Storage MERGE Algorithm for Cancer Treatment
students
A Center of Computing Research
A Center of Biomedical Research…
…and Global Health & Development
Award-winning Contributions
Best Papers at top research conferences:
Institution Score Microsoft Research 46.0 University of Washington 38.1 MIT 34.3 CMU 34.2 Stanford 32.8 UC Berkeley 25.2 University of Toronto 14.3 Cornell University 14.0 University of Illinois 13.7 IBM Research 12.8
Source: jeffhuang.com