New Computing Approaches unlimited release SAND2017-0924 C Erik P. - - PowerPoint PPT Presentation

new computing approaches
SMART_READER_LITE
LIVE PREVIEW

New Computing Approaches unlimited release SAND2017-0924 C Erik P. - - PowerPoint PPT Presentation

Photos placed in horizontal position with even amount of white space between photos and header Computational Complexity and Approved for New Computing Approaches unlimited release SAND2017-0924 C Erik P. DeBenedictis, Center for Computing


slide-1
SLIDE 1

Photos placed in horizontal position with even amount of white space between photos and header

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP

Computational Complexity and

Wildly Heterogeneous Post-CMOS Technologies Meet Software

New Computing Approaches

Erik P. DeBenedictis, Center for Computing Research, Sandia

Approved for unlimited release SAND2017-0924 C

1

slide-2
SLIDE 2

Overview

Logic devices fall into categories by potential upside

  • A large class of devices are

limited by thermodynamics

– CMOS in this large class and has a big head start – A common limit precludes any from being a lot better than the others

  • However, differences are

worth exploiting

  • How do we compare within

the large class? Solution: complexity theory based on a kT measure

  • Discounts CMOS’ maturity

advantage by assessing physical limits of energy efficiency in units of kT

  • Use algorithmic complexity

to assess devices’ ability to combine into useful functions

  • Need analog vs. digital kT

comparisons to work

2

slide-3
SLIDE 3

Scope of Talk is the Red Class

Name of approach Performance limit or other capability Investment to date Neural networks (irrespective of implementation) Learning and maybe intelligence1 Billions Quantum computing (superconducting electronics) Quantum speedup Billions Neuromorphic computing, i. e. implementations of neural networks Thermodynamic (kT)1 Billions Novel devices: Spintronics, Carbon Nanotubes, Josephson Junctions, new memories, etc. Thermodynamic (kT)2 Millions (each) Analog computing Thermodynamic (kT)3 Millions “3D+architecture,” i. e. continuation of Moore’s law Thermodynamic (kT)4 Trillion Reversible computing Arbitrarily low energy/op Millions

1 DeBenedictis, Erik P. "Rebooting Computers as Learning Machines." Computer 49.6 (2016): 84-87. 2 DeBenedictis, Erik P. "The Boolean Logic Tax." Computer 49.4 (2016): 79-82. 3 DeBenedictis, Erik P. "Computational Complexity and New Computing Approaches." Computer 49.12 (2016): 76-79. 4 DeBenedictis, Erik P. “It’s Time to Redefine Moore’s Law Again.” Computer 50.2 (2017): 40-43 (still in print)

3

slide-4
SLIDE 4

Overview of Example

Memristor-based neural networks as an example

  • Analog memristor-based

neural networks are claimed to be more energy-efficient than a digital implementation

  • Difficulties in comparison

– Scale: Measured memristor circuits are small, but a GPU cluster can execute billions of synapses – Precision: Memristors typically have a dozen levels, but GPUs use floating point

Analyzing via complexity theory based on kT measure

  • Let’s compare limits

– Digital kT limits via Landauer, etc. – Analog kT limits from circuit theory

  • Result (below, will derive)
  • Interpretation (will discuss)

– There is a parameter space of scale and precision where each is best

Edigital Eanalog = ~24 = ~1/24 ln(1/perror) log2

2(L)

ln(1/perror) L2 N N2 kT kT

4

slide-5
SLIDE 5

Novelty of Next Few Slides

  • To compare digital and analog, we need a

perror, or the probability that the answer will be

  • wrong. Reliability goes up with energy, so we

need a common reference point.

  • Analog circuits are limited by thermal noise of

magnitude kT, but the theory is not organized in the same way as digital minimum energy.

  • The terminology has to line up.

5

slide-6
SLIDE 6

Digital Minimum Energy

Digital circuit

  • Vectors v and w are inputs

Minimum energy

  • perror per input is e-Esignal/kT
  • Leading to gate energy

Egate = ~2 ln(1/perror) kT assuming 2 inputs

  • L distinguishable levels

require log2L-bit binary numbers

  • Multiplier array is about 6N2

gates and assume 100%

  • verhead 2×

Edigital = ~24 ln(1/perror) log2

2(L)

N kT

6

slide-7
SLIDE 7

Analog Minimum Energy I

Analog circuit

  • Inputs v and w = 1/g
  • L = 2V/Vpn, where V is

supply and Vpn is peak noise at amp. Circuit analysis

  • Pn = 4kTf = Vn

2 ½gmaxN,

Pn (Vn) is noise power (voltage) at amplifier, f is amplifier bandwidth and conductivities are 0…gmax

  • Vpn = Vn Av √ln(1/perror),

where Vpn is peak noise

  • Pdot

(B) = 1/6 V2 gmax N, where

Pdot is power of dot product

  • Edot = Pdot

(B) / (2f), where

Edot is energy at Nyquist freq

7

slide-8
SLIDE 8

Analog Minimum Energy II

So now what happens?

  • Landauer’s contribution was

to establish implementation- independent minimum energies for computation.

  • The previous slide was just a

bunch of circuit equations

– Two equations with gmax – Two equations with V – Two equations with f

  • If Landauer was right, the

circuit values should cancel.

  • Hmm. Let’s try…

Eanalog = ~1/24 ln(1/perror) L2 N2 kT

8

slide-9
SLIDE 9

Comparison of Minimum Energies

Each “wins” in a region of the parameter space How can this be right? The human brain is misplaced

  • Well, actually, the human

brain is digital.

  • Tell story: Neuroscientist

Brad looked at result and said “oh yeah, biology uses level-based signaling in C. elegans and retinas…” Ha,

  • nly small scale. So maybe

god/evolution figured this

  • ut already

Edigital Eanalog = ~24 = ~1/24 ln(1/perror) log2

2(L)

ln(1/perror) L2 N N2 kT kT

9

slide-10
SLIDE 10

10

What’s Different?

  • Variable energy per multiply, at equal precision
  • Divide by N for energy per arithmetic operation:
  • The energy consumed by an analog multiply depends on how

many times the result is added up.

  • …or maybe, multiplies are free, but adds are not?
  • Why? Circuit equations rule, but intuitively, signals flow

backwards through the memristor array (show the audience). Consequence: Algorithms do not readily transport from analog to digital and vice versa. Edigital Eanalog /N = ~24 /N = ~1/24 ln(1/perror) log2

2(L)

ln(1/perror) L2 N kT kT Look here

slide-11
SLIDE 11

Second Example: Ultra Low-energy Synapse

11

  • The kT limits approach can be applied to a quasi-analog

neural synapse

  • Achieves much less than kT energy dissipation per training

cycle

  • Why?
  • Most neural network learning is merely verifying that the system has

learned what it needs to know

  • Only state changes need to dissipate energy
  • Ref: DeBenedictis, Erik P., et al. "A path toward ultra-low-

energy computing." Rebooting Computing (ICRC), IEEE International Conference on. IEEE, 2016.

slide-12
SLIDE 12

Landauer’s Method Extracted From his Paper

prob p q r p1 q1 r1 Si (k's) State Sf (k's) 0.125 1 1 1  1 1 1 0.25993 a 0.25993 0.125 1 1 0  1 0.25993 b 0.25993 0.125 1 1  1 1 0.25993 g 0.367811 0.125 1 0  0.25993 d 0.367811 0.125 1 1  1 1 0.25993 g 0.125 1 0  0.25993 d 0.125 1  1 1 0.25993 g 0.125 0  0.25993 d 2.079442 Sf (k's) 1.255482 0.823959 Si-Sf (k's)

p q r p1 q1 r1 Typically of the order of kT for each irreversible function System: From source: 12

[Landauer 61] Landauer, Rolf. "Irreversibility and heat generation in the computing process." IBM journal of research and development 5.3 (1961): 183-191.

…typically of the order of kT for each irreversible function

slide-13
SLIDE 13

Backup: Details

  • Each input combination gets a row
  • Each input combination k has probability pk, pk’s summing to 1
  • Si (i for input) is the sum of all pk log pk’s
  • Each unique output combination is analyzed
  • Rows merge if the machine produces the same output
  • Each output combination k has probability pk, pk’s summing to 1
  • Sf (f for final) is the sum of all pk log pk’s
  • Minimum energy is Si – Sf
  • Notes
  • Inputs states that don’t merge do not raise minimum energy
  • Inputs that merge raise minimum energy based on their probability
  • Assumption: All input combinations equally probable

13

slide-14
SLIDE 14

Example: a Learning Machine

continues indefinitely 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0

Old-style magnetic cores

1 1 0 1 0 0 0 0 0 0 0 0

Signals create currents; core flips a 1.5

This “learning machine” example exceeds energy efficiency limits of Boolean logic. The learning machine monitors the environment for knowledge, yet usually just verifies that it has learned what it needs to know. Say “causes” (lion, apple, and night) and “effects” (danger, food, and sleep) have value 1. Example input: {lion, danger } {apple, food } {night, sleep } {lion, danger } {apple, food } {night, sleep } {lion, danger } {apple, food } {night, sleep } {lion, danger, food } {apple, food } {night, sleep } { lion, danger } {lion, danger } Functional example: Machine continuously monitors environment for {1, 1} or {-1, -1} pairs and remembers them in state of a magnetic

  • core. Theoretically, there is no need for energy

consumption unless state changes.

lion apple night danger food sleep

14

slide-15
SLIDE 15

Analysis of One Synapse

left wire right wire field dir. left wire right wire field dir. Si (k's) State Sf (k's) 0.062438

  • 1
  • 1
  • 1 
  • 1
  • 1
  • 1 0.173176 A

0.062438

  • 1
  • 1 
  • 1
  • 1 0.173176 B1

0.173176 0.062438

  • 1

1

  • 1 
  • 1

1

  • 1 0.173176 C1

0.173176 0.062438

  • 1
  • 1 
  • 1
  • 1 0.173176 D1

0.173176 0.062438

  • 1 
  • 1 0.173176 E1

0.173176 0.062438 1

  • 1 

1

  • 1 0.173176 F2

0.173176 0.062438 1

  • 1
  • 1 

1

  • 1
  • 1 0.173176 G1

0.173176 0.062438 1

  • 1 

1

  • 1 0.173176 H1

0.173176 0.0005 1 1

  • 1 

1 1 1 0.0038 I 0.174061 0.0005

  • 1
  • 1

1 

  • 1
  • 1
  • 1

0.0038 A 0.174061 0.062438

  • 1

1 

  • 1

1 0.173176 B2 0.173176 0.062438

  • 1

1 1 

  • 1

1 1 0.173176 C2 0.173176 0.062438

  • 1

1 

  • 1

1 0.173176 D2 0.173176 0.062438 1  1 0.173176 E2 0.173176 0.062438 1 1  1 1 0.173176 F2 0.173176 0.062438 1

  • 1

1  1

  • 1

1 0.173176 G2 0.173176 0.062438 1 1  1 1 0.173176 H2 0.173176 0.062438 1 1 1  1 1 1 0.173176 I 2.778417 Sf (k's) 2.772585 probability of a learning event: 0.005831 0.001 Si-Sf (k's)

Boolean logic equivalent system:

continues indefinitely 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -1 1 0 1 -1

Old-style magnetic core 15

slide-16
SLIDE 16

Why is the “Limit” so Low? Probabilities, Aggregation, and PIM Principles

  • Synapses usually just verify that they have learned what the

need to know and actually change state with low probability. Only state changes need to dissipate.

  • The Landauer’s minimum energy stays the same or rises when

a function is broken up into pieces – it cannot decrease

  • If splitting into pieces produces intermediate variables that have to be

erased, minimum energy will increase

  • If the pieces digitally restore signals, they can’t be aggregated
  • Logic-memory integration
  • helps. If you have to ship

data a long distance, you probably can’t use a single Landauer table

Notes: like Landauer’s “machine,” but r and l are trits & s, s1 are state l r s l1 r1 s1 Trit inputs

16

slide-17
SLIDE 17

Can We Find a Device or Circuit that Might be Able to Reach the Limit Described?

  • Requirements
  • Row, column addressable (i. e. the array)
  • Addressed cell can be set to 1 or -1; all other cells unchanged
  • Zero dissipation if cell unaddressed or value already correct
  • Minimum energy (TDS) if cell changes state
  • Literature
  • P. Zulkowski and M. DeWeese, “Optimal finite-time erasure of a

classical bit,” Physical Review E 89.5 (2014): 052140.

  • Uses a protocol for raising/lowering barriers and tilt
  • Dissipation –TDS + O(1/tf), Landauer’s minimum as time limit tf  
  • we can have a lot of discussion on this if you like
  • Is there a circuit that does this?

17

slide-18
SLIDE 18

Semenov’s nSQUID circuit

  • A. Circuit
  • B. Measurements
  • V. K. Semenov, G. V. Danilov, and D. V. Averin, “Negative-inductance SQUID as the basic element of reversible Josephson-junction circuits,” Applied Superconductivity,

IEEE Transactions on 13.2 (2003): 938-943.

  • D. Behavior

2.5 ln 2 kT/ for 16 devices; ~1/3 kT/device)

  • C. Micrograph

18

slide-19
SLIDE 19

19

Addition of Addressing

Icol0 Icol1 Icol2 Irow0 Irow1 Irow2 Idata

  • Author proposes addressing,

which was not present in Semenov’s work

  • Excel spreadsheet of wells
  • Top: addressed
  • Lower: Un- and half-addressed
  • A. Array addressing

1 2 3 4 5 6 7 8 1 7 13 19 25 31 37 43 I- (Data) Energy

Selected Half select and unselected

slide-20
SLIDE 20

Conclusions I

What did we do?

  • Analog and digital

devices/gates can be compared by minimum number of kT to compute a result

– Given perror that computation gives wrong answer – Circuit has to be set up properly

  • Took many tries to get the

terminology to line up

  • Two examples given

Memristor Learning

  • Memristor is a straight

classical device

  • Circuit analyzed as an

algorithm using minimum energy in units of kT as the measure

  • Could beat digital at low

precision and low complexity

20

slide-21
SLIDE 21

Conclusions II

Superconducting Circuit

  • Absolute bizarre technology,

albeit physically demonstrated

  • Classical and beats

Landauer’s limit without being reversible

  • However, analysis at the

algorithm level is essential to the low-energy result

– No way to enter the probabilities otherwise

What next?

  • Circuits are algorithms that

create complex functions

  • ut of multiple devices
  • Circuit-algorithm minimum

energy measureable is a function of problem parameters × kT.

  • If we assume digital will

approach physical limits, approach tells us when analog can compete

21