Approximate Computing and Storage from Programming Language to - - PowerPoint PPT Presentation

approximate computing and storage from programming
SMART_READER_LITE
LIVE PREVIEW

Approximate Computing and Storage from Programming Language to - - PowerPoint PPT Presentation

Approximate Computing and Storage from Programming Language to Hardware and Molecules Luis Ceze Paul G. Allen School of Computer Science & Engineering University of Washington IEEE JOURN.4L OF SOLID-ST.iTE CIRCUITS, 5> OCTOBER 1974 2456


slide-1
SLIDE 1

Paul G. Allen School of Computer Science & Engineering University of Washington

Luis Ceze

Approximate Computing and Storage from Programming Language to Hardware and Molecules

slide-2
SLIDE 2

Moore’s law gives us lots

  • f transistors on a chip.

But it is Dennard scaling that lets us use them: 2x transistor count, 40% faster, 50% more efficient…

IEEE JOURN.4L OF SOLID-ST.iTE CIRCUITS, VOL. SC-9, NO. 5> OCTOBER 1974 2456 [31 [41 [5’1 [61 [71 [81 [91 [101 [111
  • E. J. Boleky,
‘%ubnanosecond switching delays using CMOS/ SOS silicon-gate technology,” in 1971 Int. Solid-State Cir- cuit Conj., Dig. Tech. Papers,
  • p. 225,
  • E. J. Boleliy
and J. E. Meyer, “High-performance low-power CMOS memories using silicon-on-sapphire technology,” IEEE
  • J. Solid-State
Circuits (Special Issue
  • n
Micropower Electronics), vol. SC-7, pp. 135-145, Apr. 1972.
  • R. W. Bower,
  • H. G. Dill,
K.
  • G. Aubuchon,
and S. A. Thomp- son, ‘[MOS field effect transistors by gate masked ion im- plantation,” IEEE !t’’rams. Electron Devices, vol. ED-15, pp. 757-761, Oct. 1968.
  • J. Tihanyi,
“Complementary ESFI MOS devices with gate self adjustment by ion implantation,” in Proc. 5,th Iwt. Conj. Microelectronics in Munich, Nov. 27–29, 1972. Munchen- Wien, Germany:
  • R. Oldenbourg
Verlag,
  • pp. 437447.
E. J. Boleky, “The performance
  • f
complementary MOS transistors
  • n insulating
substrates,” RCA Rev., vol. 80, pp. 372-395, 1970. K. Goser, ‘[Channel formation in an insulated gate field effect transistor (IGFET) and its emrivalent circuit .“ Sienzen.s Forschungs- und Entwiclclungsbekhte, no. 1, pp.’ 3-9, 1971.
  • A. E.
Ruehli and P, A. Brennan, “Accurate metallization capacitances for integrated circuits and packages,” IEEE J. Solid-State Circwits (Corresp.), vol. SC-8, pp. 289-290, Aug. 1973. SINAP (Siemens Netzwerk Analyse Programm Paket), Siemens AG, Munich, Germany. K, Goser and K. Steinhubl, ‘[Aufteilung der Gate-Kanal- Kapazitat auf Source und Drain im Ersatzschaltbild eines MOS-Transistors,” Siemenx Forxchungs- und Ent wicldwrgs- berichte 1, no. 3$pp. X4-286, 1972. [121 J. R. Burns, “Switching response
  • f
complementary+sym- metry MOS transistors logic circuits,” RCA Rev., vol. 25,
  • pp. 627481,
1964. [131 R. w. Ahrons and P. D. Gardner, ‘[Introduction
  • f
tech- nology and performance in complementary symmetry cir- cuits,” IEEE J. Solid-State Circuits (Special Issue
  • n
Tech- nology jor Integrated-Circuit Design), vol. SC-5, pp. 24–29,
  • Feb. 1970.
[141 F. F. Fang and H. Rupprecht, “High performance MOS in- tegrated circuits using ion implantation technique,” pre- sented at the 1973 ESSDERC, Munich, Germany, Michael Pomper, for a photograph and biography, please see p. 238 of this issue. Jeno Tlhanyi, for a photograph and biogra~hy, please see p. 238 of this issue. Design
  • f Ion-Implanted
MOSFET’S with Very Small Physical Dimensions ROBERT H. DENNARD, LIEMBER, IEEE, FRITZ H. GAENSSLEN, HWA-NIEN YU, MEMBER, IEEE, V. LEO RIDEOUT, MEMBER) IEEE, ERNEST BASSOUS, AND ANDRE
  • R. LEBLANC,
MEMBER, IEEE Absfracf—This paper considers the design, fabrication, and characterization
  • f very
small MOSI?ET switching devices suitable for digital integrated circuits using dimensions
  • f the order
  • f 1 p.
Scaling relationships are presented which show how a conventional MOSFET can be reduced in size. An improved small device struc- ture is presented that uses ion implantation to provide shallow source and drain regions and a nonuniform substrate doping pro- file. One-dimensional models are used to predict the substrate doping profile and the corresponding threshold voltage versus source voltage characteristic. A two-dimensional current transport model is used to predict the relative degree
  • f short-channel
effects for different device parameter combinations. Polysilicon-gate MOSFET’S with channel lengths as short as 0.5 ~ were fabricated, and the device characteristics measured and compared with pre- dicted values. The performance improvement expected from using these very small devices in highly miniaturized integrated circuits is projected. Manuscript received May 20, 1974; revised July 3, 1974. The aubhors are with the IBM
  • T. J. Watson
Research Center, Yorktown Heights, N.Y. 10598. a D AW, LIST OF SYMBOLS Inverse semilogarithmic slope
  • f sub-
threshold characteristic. Width
  • f idealized
step function pro- fde for chaDnel implant. Work function difference between gate and substrate. Dielectric constants for silicon and silicon dioxide. Drain current. Boltzmann’s constant. Unitless scaling constant. MOSFET channel length. Effective surface mobility. Intrinsic carrier concentration. Substrate acceptor concentration. Band bending in silicon at the onset of strong inversion for zero substrate voltage. [Dennard, Gaensslen, Yu, Rideout, Bassous, Leblanc, IEEE JSSC, 1974]
slide-3
SLIDE 3

# of transistors power

... ...

generations power density

“Dark silicon”

(boo!)

Can’t have all transistors on!

Explicit parallelism is important, but not a longer term solution…

With Dennard scaling dead (~2005-8), power per transistor stays constant…

slide-4
SLIDE 4

“The number of people predicting the end of Moore’s Law doubles every year.” – Doug Carmean (Intel->MSFT)

And… The Economist says…

slide-5
SLIDE 5

So how do we make computer systems better?

Unavoidable that we need to exploit application properties and specialize.

Source: Bob Broderson, Berkeley Wireless group

slide-6
SLIDE 6

YouTube: 100 hours of video uploaded per

  • minute. 4 billion video

views a day.

One trillion photos were taken in 2015. Compared to 80 billion in 2000, 12.5x over 15 years

These applications consume most of our cycles/bytes/bandwidth Input data is inexact by nature (sensors) They have multiple acceptable outputs

slide-7
SLIDE 7

What is “approximate computing”?

~

Exploit inherent application-level resilience/ redundancies to build more efficient/better computers. Output accuracy Efficiency and performance

Physics Circuits ISA/Architecutre Compiler Language Application

In essence, goal is to specialize computation, storage and communication to properties of the data and the algorithm. Squeezes excess precision and enables better use of underlying substrate.

slide-8
SLIDE 8

Wait, what about… :)

Applications/ Algorithms

Floating point… Machine learning Iterative algorithms Lossy compression ~4X ~5X ~10X ~10-100X

Physics Circuits ISA/Architecture Compiler Language

Non-deterministic “unsafe” circuits Analog hardware (closer to physics) Approximate execution models Reasoning about approximation in PL Approximate compiler optimizations

(voltage scaling, mem)

slide-9
SLIDE 9
slide-10
SLIDE 10

Energy Errors Energy Errors Energy Errors Energy Errors

slide-11
SLIDE 11

Acceptability

0% 20% 40% 60% 80% 100%

Energy (abstract units)

100 200 300 400 500

Without Trade-Off :( With Trade-Off :)

Ground-Truth Quality

from Crowdsourced Opinions

slide-12
SLIDE 12

But approximation needs to be done carefully... or...

slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15

HW/SW co-design is essential…

Approximation just at the hardware level isn’t safe. Approximation just at the algorithm level is suboptimal. Assuming reliable hardware for inherently robust algorithms is a waste.

slide-16
SLIDE 16

1 2 3

What and how to approximate? How good is my output? How to take advantage of it?

Hardware Runtime Compiler Language

Three key questions

slide-17
SLIDE 17

“Disciplined” approximate programming

Precise Approximate

✗ ✓

references jump targets JPEG header pixel data neuron weights audio samples video frames

  • Programmer has direct control of approximate/precise and the flow
  • System is free to approximate as long as rules are obeyed
slide-18
SLIDE 18

EnerJ/C++

Separate critical and non-critical program components. Analyzable statically.

Precise Approximate

✓ ✗

int a = ...; int p = ...; @Approx @Precise p = a; a = p;

[PLDI’11, OOPSLA’15]

  • Operator overloading for approximate operations:
  • Endorsement of approximate values:
  • Dealing with implicit flows in control:

int a = ...;

@Approx<0.9>

int p = ...; @Precise

p + p; p + a; a + a;

if ( p = 2; } a > 10)

if ( p = 2; } a > 10 ) { endorse( ) endorse(a); p =

slide-19
SLIDE 19

How good is my output?

  • Quality-of-Result (QoR)
  • Application dependent

– e.g, % of bad pixels, deviation from expected value, % of poorly classified images, car crashes, etc…

slide-20
SLIDE 20

Specifying and checking quality

res = computeSomething(); assert diff(res, resʹ) < 0.1;

precise version of the result

passert expr, prob, conf

slide-21
SLIDE 21

Verifying quality expressions

+ input and error distribution Bayesian network IR

  • ptimized

Bayes net sampling exact evaluation

[PLDI’14]

int p = 5; @Approx int a = 7; for (int x = 0..) { a += func(2); @Approx int z; z = p * 2; p += 4; } a /= 9; p += 10; socket.send(z); write(file, z);

?

slide-22
SLIDE 22

Online quality monitoring

Can react – recompute or reduce approximation But needs to be cheap!

Sampled precise re-execution Simple verification functions Fuzzy Memoization

<ε? <ε?

  • <ε?

[ASPLOS’15]

  • Time
slide-23
SLIDE 23

How to take advantage of approximation?

CPU

Approximate functional units, data path, registers, caches. Approximate on-chip interconnect? Mixed-mode functional units? Approximate parallelization? Approximate accelerators.

Acc CPU

Hardware

slide-24
SLIDE 24

Dual- Voltage Core

VDDH VDDL

Dual-voltage approximate CPU

[ASPLOS 2012]

Fetch Decode Reg Read Execute Memory WB

  • Br. Predictor

Instruction Cache ITLB Decoder Register File Integer FU FP FU Data Cache DTLB Register File

replicated functional units dual-voltage SRAM arrays

slide-25
SLIDE 25

Dual- Voltage Core

VDDH VDDL

Dual-voltage approximate CPU

[ASPLOS 2012]

Compiler

ld 0x04 r1 ld 0x08 r2 add.a r1 r2 r3 st.a 0x0c r3

7–24% energy saved on average

(fft, game engines, raytracing, QR code readers, etc) (scope: processor + memory)

not good... :(

(though better implementations likely)

slide-26
SLIDE 26

Amdahl’s law... damn!

Fetch Decode Reg Read Execute Memory Write Back

Branch Predictor Instruction Cache ITLB Decoder Register File Integer FU FP FU Data Cache DTLB Register File

  • Benefit limited to what can be approximated
  • Instruction control can not be approximated
slide-27
SLIDE 27

CPU

Very efficient hardware implementations! Trainable to mimic many computations! Recall is imprecise.

Neural Networks as Approximate Accelerators

Fault tolerant

[Temam, ISCA 2012]

[MICRO 2012, ISCA 2014, HPCA’15, DATE’18 ]

slide-28
SLIDE 28

Program

Neural acceleration

slide-29
SLIDE 29

Neural acceleration

Program

Find an approximate program component

slide-30
SLIDE 30

Program

Compile the program and train a neural network

Neural acceleration

Find an approximate program component

slide-31
SLIDE 31

Program

Compile the program and train a neural network Execute on a fast Neural Processing Unit (NPU)

Neural acceleration

Find an approximate program component

slide-32
SLIDE 32

An example: Sobel filter

@approx float grad(approx float[3][3] p) { … }

edgeDetection()

void edgeDetection(aImage &src, aImage &dst) { for (int y = …) { for (int x = …) { dst[x][y] = grad(window(src, x, y)); } } }

@approx float dst[][];

Approximable

Well-defined inputs and outputs

slide-33
SLIDE 33

Neural Processing Unit

Core NPU

input

  • utput

configuration enq.c deq.c enq.d deq.d

slide-34
SLIDE 34

A canonical Neural Processing Unit

Bus Scheduler

Processing Engines

input

  • utput

scheduling

slide-35
SLIDE 35

Summary of NPU results

CPU F D X I M C

NPU

0.8x - 11.1x (3x mean) speedup 1.1x - 21x (3x mean) energy red.

CPU

FP G A

1.3x - 38x (3.8x mean) speedup 0.9x - 28x (2.8x mean) energy red.

I(xi) I(xn) I(x0) R(w0) R(wi) R(wn) ADC X (I(xi)R(wi)) y ≈ sigmoid( X (I(xi)R(wi))) DAC DAC DAC x0 xi xn … … (b)

0.9x - 24x (3.7x mean) speedup 1.5x - 51x (6.8x mean) energy red.

application domain error metric blackscholes option pricing MSE fft DSP MSE inversek2j robotics MSE jmeint 3D-modeling miss rate jpeg compression image diff kmeans ML image diff sobel vision image diff

NPU AFU IMEM FIFO Memory Arbiter OpenMSP430 Shared DMEM GPIO & Serial SRAM SRAM SRAM

+

SRAM SRAM

+

AFU LUT + AFU_OUT AFU_OUT DMEM_OUT ACCUM ACCUM

...

PE Systolic Array FIFO

...

Control Weight SRAM PE 0 MAC Weight SRAM PE 7 MAC
slide-36
SLIDE 36

Approximate Program Synthesis

Precise Implementation Desired Quality Approximate Program

float dist_approx(int a[3], int b[3]) { int c1 = abs(b[0] - a[0]); int c2 = abs(b[1] - a[1]); int c3 = abs(a[2] - b[2]); int c4 = c1 | c2; int c5 = abs(c3 > c4 ? c3 : c4); return (float)c5; }

3D Euclidean distance 1.6× faster, 14.9% error

Automatically discover the most efficient acceptable program [POPL’16]

slide-37
SLIDE 37

ACCEPT, a framework for approximate programming

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

slide-38
SLIDE 38

Deep Learning Compute Requirements is Steadily Growing

Source: Eugenio Culurciello, An Analysis of Deep Neural Network Models for Practical Applications, arXiv:1605.07678 parameter size

slide-39
SLIDE 39

Neural Networks and Quantization

Specialized approximation: understand how parameters are generated/ trained, and adjust them accordingly to compensate for approximations

Application

Original Parameters

Approximate Application

Adjusted Parameters

Accuracy of a MLP (784-128-64-10) trained on MNIST

Validation Accuracy (%) 0% 20% 40% 60% 80% 100% Weight Precision (bits) 8 7 6 5 4 3 2 1 96.78% precise Quantize after Training Quantize during Training

much more graceful quality degradation

slide-40
SLIDE 40

Pareto-Optimization Challenge

Accuracy Throughput

NN Model, Topology Parameter Compression Operators Compiler Optimization Architectural Exploration

Search space grows exponentially!

slide-41
SLIDE 41

Metal-Learning Preliminary Study

Proof-of-concept study: Explore impact of topology exploration, parameter compression and architectural exploration on accuracy vs. performance (MNIST + MLP)

#43

Validation Accuracy (%) 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 Inference time (ms) 0.5 1 1.5 2

8-bit 7-bit 6-bit 5-bit 4-bit 3-bit 2-bit

knee of the pareto frontier

NN Model, Topology Parameter Compression Architectural Exploration

slide-42
SLIDE 42

SRAM Vdd Overscaling

Kim et al. MATIC [DATE’18 Best Paper]

NPU AFU IMEM FIFO Memory Arbiter OpenMSP430 Shared DMEM GPIO & Serial SRAM SRAM SRAM

+

SRAM SRAM

+

AFU LUT + AFU_OUT AFU_OUT DMEM_OUT ACCUM ACCUM

...

PE Systolic Array FIFO

...

Control Weight SRAM PE 0 MAC Weight SRAM PE 7 MAC

Tape-out of the SNNAP + MSP430 design Use SRAM voltage scaling as a knob to reduce power

Rely on memory-adaptive training to gracefully degrade application error

Normalized Error Voltage

slide-43
SLIDE 43

Automatic Generation of HW-SW Stack for Deep Learning

slide-44
SLIDE 44

Addressing the Programmability Challenge with TVM

VTA FPGA Design Runtime & JIT Compiler FPGA/SoC Drivers RPC Layer TVM Compiler NNVM Graph High-Level Deep Learning Framework

TVM DSL allows for separation of schedule and algorithm

tvmlang.org

slide-45
SLIDE 45

Another important trend…

slide-46
SLIDE 46

12,500 25,000 37,500 50,000 2005 2007 2009 2011 2013 2015 2017 2019 Data Generated Installed Capacity

Data: IDC Storage (exabytes)

Mostly videos and images

slide-47
SLIDE 47

Fast Dense

Accurate

high low

probability

00 11 01 10

Precise

high low

probability

000 011 001 010 100 111 101 110

Approximate

[MICRO’13, ASPLOS’16, ASPLOS’17]

2x gain in density and performance JPEG-XR H.264

01010 10100 0101010100010101001011010 10101010011001001001010

Approximate Storage

slide-48
SLIDE 48

Molecular data storage: few atoms per bit All major storage technologies approaching their limits SMR, HAMR, bit patterning,? 3-4 scaling steps more? Going vertical, limit? Limited by wavelength of light

slide-49
SLIDE 49

10100011100 100011110011 11100010110 01010010111 101…

Manufactured DNA

Using Synthetic DNA for Data Storage

slide-50
SLIDE 50

Photo: Tara Brown / UW

~10+ TB

slide-51
SLIDE 51

1 Exabyte in 1 in3

Extremely Dense

100,000s of Years

Extremely Durable Never Gets Obsolete

DNA Molecules for Digital Data

Making Copies Is Nearly Free

slide-52
SLIDE 52

Write Path

1 1 0 1 1 1 0 1 Encoding

1

A G C TAT C A G Synthesis

2

Read Path

Sequencing A G C TAT C A G

1

1 1 0 1 1 1 0 1 Decoding

2

slide-53
SLIDE 53

Molecular domain Electronic domain

Hardware, Software, Wetware

slide-54
SLIDE 54

20 Watts

slide-55
SLIDE 55

Approximation Electric signals Molecular signals and processes (proteins, ions) Physical circuit plasticity Approximation and moving parts are an inherent advantage of natural computing!

slide-56
SLIDE 56

Computer architecture, coding theory, molecular biology, fluidics, automation Computer architecture, programming languages, operating systems and machine learning

sa pa

slide-57
SLIDE 57

sa pa

Luis Ceze University of Washington

Obrigado!

luisceze@cs.washington.edu

slide-58
SLIDE 58

3D-Printed WiFi UW Reality Lab for AR/VR Research Battery-Free Cellphone Mobile Health DNA Data Storage MERGE Algorithm for Cancer Treatment

  • ~ 70 full-time faculty, and growing
  • ~ 25 postdoctoral researchers
  • > 1,000 undergrads (CS + CE)
  • ~ 100 combined BS/MS students
  • > 250 full-time Ph.D. students
  • ~ 180 Professional Masters

students

  • ~ 100 full-time staff
  • 1 Home Exploring Robot Butler
slide-59
SLIDE 59

A Center of Computing Research

slide-60
SLIDE 60

A Center of Biomedical Research…

slide-61
SLIDE 61

…and Global Health & Development

slide-62
SLIDE 62

Award-winning Contributions

Best Papers at top research conferences:

Institution Score Microsoft Research 46.0 University of Washington 38.1 MIT 34.3 CMU 34.2 Stanford 32.8 UC Berkeley 25.2 University of Toronto 14.3 Cornell University 14.0 University of Illinois 13.7 IBM Research 12.8

Source: jeffhuang.com