High Performance and Energy Efficient Machine Learning Accelerators - - PowerPoint PPT Presentation

high performance and energy efficient
SMART_READER_LITE
LIVE PREVIEW

High Performance and Energy Efficient Machine Learning Accelerators - - PowerPoint PPT Presentation

High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision Circuits for Sub-10nm Microprocessors (Invited Paper) Ram Krishnamurthy Senior Principal Engineer & Director of High Performance and Low Voltage


slide-1
SLIDE 1

1 of 16

High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision Circuits for Sub-10nm Microprocessors

(Invited Paper)

Ram Krishnamurthy Senior Principal Engineer & Director of High Performance and Low Voltage Circuits Research Group Circuit Research, Intel Labs Hillsboro, Oregon, U.S.A.

slide-2
SLIDE 2

2

Era of Tera-scale Computing

Teraflops of performance operating on Terabytes of data

Terabytes TIPS Gigabytes MIPS Megabytes GIPS

Performance Dataset Size

Kilobytes KIPS

Mult- Media

3D & Video

Text

Models

Personal Media Creation and Management Entertainment, learning and virtual travel Health

Terascale

Multi-core

Single-core Financial Analytics

Model-based Apps Recognition Mining Synthesis

slide-3
SLIDE 3

3 of 16

3

Motivation: ML in IoT Platforms

slide-4
SLIDE 4

4 of 16 4

Internet of Everything (IoE)

Need end-to-end energy efficiency & security

slide-5
SLIDE 5

5

5

Tera-scale Microprocessors and SoCs

Deliver best user experience under constraints

Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric

Graphics Video Special Purpose Engines Integrated Memory Controllers Off Die interconnect

Cache Cache Cache

Last Level Cache Last Level Cache Last Level Cache

Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric

Graphics Video Special Purpose Engines Integrated Memory Controllers Off Die interconnect

Cache Cache Cache

Last Level Cache Last Level Cache Last Level Cache

Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric

Graphics Video Special Purpose Engines Graphics Video Special Purpose Engines Integrated Memory Controllers Off Die interconnect

Cache Cache Cache

Last Level Cache Last Level Cache Last Level Cache

Cache Cache Cache Cache Cache Cache

Last Level Cache Last Level Cache Last Level Cache

Dynamic V/F control Independent V/F control regions Workload-based core activation & shutdown Scenario-based power allocation Maximize performance & efficiency

slide-6
SLIDE 6

6 of 16

6

Motivation: IoT Technology Scaling Trends

slide-7
SLIDE 7

7

More, better transistors More cores Continued benefits from Moore’s Law

Moore’s Law scaling

45nm

+

2007

105 103 107 109

14nm Trigate 2014

Source: Intel

slide-8
SLIDE 8

8

Performance/Energy Scaling Trends

Source: Intel

slide-9
SLIDE 9

9

Ke Key En Energ rgy Ch Challenge ge: : On Ongoi going Sc g Scalin ing g Re Requ quir ires Ar Archit itectu cture re Innov

  • vati

tion

  • n

Single Core Plateau

Single Thread Performance 1996 – 2004: Increased 28x 2004 – 2012: Increased 4.6x IPC gains now at ~3%/gen

64 128 192 256 320 384 448 45 32 22 14 10 7

# Cores Process Node

# Cores to achieve 90% max performance (Amdahl’s Law) Cores available by Process Scaling

Multicore Scalability Gap

Dark Silicon

System Integration

slide-10
SLIDE 10

Neuromorphic Computing Biological form “Intelligent” Applications

CPU MEM

if X then … else …

MEM

Brain Inspired Computing Standard Computing 01100 11010 00100

Towards Energy Efficient Neuromorphic Computing

slide-11
SLIDE 11

Energy Efficiency Challenge: Neuromorphic Accelerators for Cognitive Computing and Machine Learning

11

Good for efficiency, but problematic for SW and System complexity.

FPGA

slide-12
SLIDE 12

12

Biological Inspiration

  • Brains exhibit energy efficient intelligence at 20W
  • One-shot, unsupervised learning & inference, creativity
  • High parallelism : 100 Billion neurons
  • Rich connectivity: 100 Trillion synapses
  • Super computer implementation of brain: ~100 server racks
  • ~1500x slower, and ~500 Million times more power

iq.intel.com

slide-13
SLIDE 13

Neuromorphic Landscape is Growing

THEORY HARDWARE / SOFTWARE / SIMULATION APPLICATIONS / SOLUTIONS Neurithmic

…and more…

slide-14
SLIDE 14

14

  • H. Kaul, R. Krishnamurthy et al, ISSCC 2012

NTV Variable Precision FPU

slide-15
SLIDE 15

15 of 16

K-Nearest Neighbor ML Accelerator

15

  • On-die integrated special-purpose hardware accelerator for visual

recognition vectors matching

  • 128x128x8b vector search for the top “k nearest neighbors” (kNN)
  • Data-dependent accuracy refinement to increase energy efficiency
  • Reconfigurable for k and distance metric (Euclidean/Manhattan)

Reference Vector 0

psum127 valid127 {minaddr, minprecise, minvalid,minpsum}

Global Control

Query Object Vector (Q)

Reference Vector 1 Reference Vector 2 Reference Vector 127 Minimum Sort Network

psum0 valid0 3 3 3 3 Reference Vector Storage Partial Distance Compute Accumulator Local Control

Reference Vector

1024b 1024

slide-16
SLIDE 16

16 of 16

0000

<

Distance

Query Vector (q) Reference Vector(r)

1000 1101< (q-r)2 a < b a b Nearest Neighbor

128x8b

Sort Vector Distances

MSB LSB Narrow Bit-Width

×n ×n-1

K-Nearest Neighbor ML Accelerator

  • k-Nearest-Neighbor (kNN): power/performance limiter

for computer vision and classification workloads

  • Only closely matched vectors require higher precision

→ Adapt precision per vector to guarantee accuracy

  • Majority of vectors eliminated with low precision

→ Increased performance, reduced area and energy

slide-17
SLIDE 17

17 of 16

32 64 96 128 1 6 11 16 21 26

Search Space (Valid Vectors) Sort Iteration kth NN Found 3 10 1 Example kNN Operation (Euclidean)

Iterative Search Space Reduction

  • Up to 5.2X higher throughput from early elimination
  • Up to 127X reduction for next nearest search space
slide-18
SLIDE 18

18 of 16

128 × 128-D kNN Accelerator I/O Memory I/O and Clock 682µm 488µm 64× Shared Distance Shared Distance 8b Control1 Control0 Accumulator1 Distributed Sort Network Vector0 64 Dimensions Accumulator0 1 Dimension 2 Vector Block Vector1 Vector0 Vector1

kNN Accelerator Die Micrograph

Process 14nm Tri-gate CMOS Nominal Operation 750mV, 338MHz, 25°C Number of Transistors 12.2M Accelerator Area 0.333mm2

  • H. Kaul, R. Krishnamurthy et al, ISSCC 2016
slide-19
SLIDE 19

19 of 16

12 24 36 48 60 5 10 15 20 25 2 4 6 8 10

k Nearest Neighbors

Performance Measurements

  • 21.5M queries/s and 16 cycles/query (Manhattan, k=1)
  • Average latency increase for each successive neighbor:

2 cycles (Manhattan) and 4 cycles (Euclidean)

slide-20
SLIDE 20

20 of 16

3.0 3.5 4.0 4.5 5.0 20 35 50 65 80 2 4 6 8 10 k Nearest Neighbors

14nm CMOS, 338MHz, 750mV, 25°C

Total Power (mW) Total Energy/Query Vector (nJ) 3.37nJ/query, 9.7TOPS/W 73mW

Power Measurements

  • 73mW total power, 3.37nJ/query (Manhattan, k=1)
  • Average energy increase for each successive neighbor:

43pJ (Manhattan) and 87pJ (Euclidean)

slide-21
SLIDE 21

21 of 16

25 50 75 100 125 0.1 1 10 100 350 450 550 650 750 850 Supply Voltage (mV)

14nm CMOS, 25°C

Manhattan, k=1 Euclidean, k=1 Throughput (Million Query Vectors/s) Total Power (mW)

Supply Voltage Scaling Measurements

  • Robust NTV circuits enable 360mV-850mV operation
  • 26.4M queries/s, 114mW at 850mV (Manhattan, k=1)
  • 1.1M queries/s, 1.44mW at 360mV (Manhattan, k=1)
slide-22
SLIDE 22

22 of 16

1 2 3 4 5 6 350 450 550 650 750 850 Supply Voltage (mV)

Energy Scaling Measurements

  • Peak efficiency of 1.23nJ/query or 26.5TOPS/W at

390mV (near-threshold) → 2.73X improvement over nominal

slide-23
SLIDE 23

23

23

“Extreme” energy efficiency

2W –100 G igaFLO PS

10 year goal: ~300X Improvement in energy efficiency Equal to 20 pJ /FLOPS at the system level 20MW - E xaFLOPS

slide-24
SLIDE 24

intelligence Inside

slide-25
SLIDE 25

25 of 16

i i min i i i 2 min

  • Iteratively search for kNN within 128x128-D vectors
  • Distant vectors eliminated in early iterations
  • Reconfigurable for Manhattan and Euclidean distance

kNN Accelerator Organization

slide-26
SLIDE 26

26 of 16

i 2 min

  • Narrow single-cycle datapath for distance compute
  • Accumulate computed refinement to distance

Organization: Distance Compute

slide-27
SLIDE 27

27 of 16

st nd rd st nd rd

kNN Operation: Adaptive Precision

  • Data-dependent precision for each vector

→Reduces required compute and sort operations

  • Same nearest-neighbor result as full-precision
slide-28
SLIDE 28

28 of 16

5 10 15 20 2 3 4 5 6 7 8 9 10

Euclidean Manhattan k Nearest Neighbor Search Space Reduction (×)

Average Search Space Reduction

  • 10X-18X average reduction of starting search space for

next nearest neighbor