High Performance and Energy Efficient Machine Learning Accelerators - PowerPoint PPT Presentation

High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision Circuits for Sub-10nm Microprocessors (Invited Paper) Ram Krishnamurthy Senior Principal Engineer & Director of High Performance and Low Voltage Circuits Research Group Circuit Research, Intel Labs Hillsboro, Oregon, U.S.A. 1 of 16

Era of Tera-scale Computing Teraflops of performance operating on Terabytes of data Entertainment, learning Model-based Apps and virtual travel Recognition Mining Financial Analytics TIPS Synthesis Performance Models Personal Media GIPS Creation and Management 3D & Video Terascale Mult- MIPS Media Multi-core Text KIPS Single-core Health Kilobytes Megabytes Gigabytes Terabytes Dataset Size 2

Motivation: ML in IoT Platforms 3 3 of 16

Internet of Everything (IoE) Need end-to-end energy efficiency & security 4 4 of 16

Tera-scale Microprocessors and SoCs Special Special Special Special Graphics Graphics Graphics Graphics Purpose Purpose Purpose Purpose Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Video Video Video Video Engines Engines Engines Engines Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric Scalable On-die Interconnect Fabric Integrated Integrated Integrated Last Level Last Level Last Level Last Level Last Level Last Level Last Level Last Level Last Level Last Level Last Level Last Level Off Die Off Die Off Die Memory Memory Memory Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache interconnect interconnect interconnect Controllers Controllers Controllers Maximize Workload-based Independent Dynamic Scenario-based performance core activation V/F control V/F control power allocation & efficiency & shutdown regions Deliver best user experience under constraints 5 5

Motivation: IoT Technology Scaling Trends 6 6 of 16

Moore ’ s Law scaling 14nm 10 9 45nm Trigate 2014 2007 10 7 More, better transistors 10 5 + More cores Continued benefits 10 3 from Moore ’ s Law Source: Intel 7

Performance/Energy Scaling Trends Source: Intel 8

Ke Key En Energ rgy Ch Challenge ge: : On Ongoi going Sc g Scalin ing g Re Requ quir ires Ar Archit itectu cture re Innov ovati tion on Multicore Scalability Gap Single Core Plateau System Integration Cores available by Process Scaling 448 Single Thread Performance 384 320 1996 – 2004: Increased 28x 256 # Cores 192 2004 – 2012: Dark Increased 4.6x 128 Silicon 64 IPC gains now at ~3%/gen 0 45 32 22 14 10 7 Process Node # Cores to achieve 90% max performance (Amdahl ’ s Law) 9

Towards Energy Efficient Neuromorphic Computing Standard Computing Brain Inspired Computing Neuromorphic Computing MEM CPU MEM Biological form if X then … 01100 else 11010 … 00100 “ Intelligent ” Applications

Energy Efficiency Challenge: Neuromorphic Accelerators for Cognitive Computing and Machine Learning FPGA Good for efficiency, but problematic for SW and System complexity. 11

Biological Inspiration iq.intel.com • Brains exhibit energy efficient intelligence at 20W • One-shot, unsupervised learning & inference, creativity • High parallelism : 100 Billion neurons • Rich connectivity: 100 Trillion synapses • Super computer implementation of brain: ~100 server racks • ~1500x slower, and ~500 Million times more power 12

Neuromorphic Landscape is Growing THEORY HARDWARE / SOFTWARE / SIMULATION APPLICATIONS / SOLUTIONS Neurithmic … and more …

NTV Variable Precision FPU H. Kaul, R. Krishnamurthy et al, ISSCC 2012 14

K-Nearest Neighbor ML Accelerator 1024b Reference Vector Storage Partial Distance Compute Accumulator Local Reference Vector Control psum 0 3 Reference Vector 0 Minimum Sort Network valid 0 3 Reference Vector 1 3 Reference Vector 2 psum 127 3 Reference Vector 127 valid 127 1024 Global Query Object Control Vector (Q) {minaddr, minprecise, minvalid,minpsum} • On-die integrated special-purpose hardware accelerator for visual recognition vectors matching • 128x128x8b vector search for the top “ k nearest neighbors ” (kNN) • Data-dependent accuracy refinement to increase energy efficiency • Reconfigurable for k and distance metric (Euclidean/Manhattan) 15 15 of 16

K-Nearest Neighbor ML Accelerator Distance Sort Vector Distances ×n ×n-1 Reference Vector(r) LSB Query Vector (q) MSB 0000 a 128x8b < < (q-r) 2 b 1000 Nearest 1101 < Neighbor a b Narrow Bit-Width ● k-Nearest-Neighbor (kNN): power/performance limiter for computer vision and classification workloads ● Only closely matched vectors require higher precision → Adapt precision per vector to guarantee accuracy ● Majority of vectors eliminated with low precision → Increased performance, reduced area and energy 16 of 16

Iterative Search Space Reduction 128 Search Space (Valid Vectors) Example kNN Operation (Euclidean) 96 64 k th NN Found 32 3 1 10 0 1 6 11 16 21 26 Sort Iteration ● Up to 5.2X higher throughput from early elimination ● Up to 127X reduction for next nearest search space 17 of 16

kNN Accelerator Die Micrograph I/O and Control1 Distance Clock Shared Vector0 Vector1 I/O Memory 128 × 128-D Accumulator1 488µm kNN 2 Vector Block Distributed Accelerator Sort Network 64× 64 Dimensions Accumulator0 682µm Distance Shared Vector0 Vector1 Control0 8b 1 Dimension H. Kaul, R. Krishnamurthy et al, ISSCC 2016 Process 14nm Tri-gate CMOS Nominal Operation 750mV, 338MHz, 25°C Number of Transistors 12.2M 0.333mm 2 Accelerator Area 18 of 16

Performance Measurements 25 60 20 48 15 36 10 24 5 12 0 2 4 6 8 10 k Nearest Neighbors ● 21.5M queries/s and 16 cycles/query (Manhattan, k=1) ● Average latency increase for each successive neighbor: 2 cycles (Manhattan) and 4 cycles (Euclidean) 19 of 16

Power Measurements 80 5.0 Total Energy/Query Vector (nJ) 14nm CMOS, 338MHz, 750mV, 25°C Total Power (mW) 73mW 65 4.5 50 4.0 35 3.5 3.37nJ/query, 9.7TOPS/W 20 3.0 0 2 4 6 8 10 k Nearest Neighbors ● 73mW total power, 3.37nJ/query (Manhattan, k=1) ● Average energy increase for each successive neighbor: 43pJ (Manhattan) and 87pJ (Euclidean) 20 of 16

Supply Voltage Scaling Measurements 100 125 14nm CMOS, 25°C Manhattan, k=1 (Million Query Vectors/s) Euclidean, k=1 100 Total Power (mW) Throughput 10 75 50 1 25 0.1 0 350 450 550 650 750 850 Supply Voltage (mV) ● Robust NTV circuits enable 360mV-850mV operation ● 26.4M queries/s, 114mW at 850mV (Manhattan, k=1) ● 1.1M queries/s, 1.44mW at 360mV (Manhattan, k=1) 21 of 16

Energy Scaling Measurements 6 5 4 3 2 1 350 450 550 650 750 850 Supply Voltage (mV) ● Peak efficiency of 1.23nJ/query or 26.5TOPS/W at 390mV (near-threshold) → 2.73X improvement over nominal 22 of 16

“ Extreme ” energy efficiency 2W – 100 G 20MW - E xaFLOPS igaFLO PS 10 year goal: ~300X Improvement in energy efficiency Equal to 20 pJ /FLOPS at the system level 23 23

intel ligence Inside

kNN Accelerator Organization i i 2 i i min i min ● Iteratively search for kNN within 128x128-D vectors ● Distant vectors eliminated in early iterations ● Reconfigurable for Manhattan and Euclidean distance 25 of 16

Organization: Distance Compute 2 i min ● Narrow single-cycle datapath for distance compute ● Accumulate computed refinement to distance 26 of 16

kNN Operation: Adaptive Precision rd rd st st nd nd ● Data-dependent precision for each vector → Reduces required compute and sort operations ● Same nearest-neighbor result as full-precision 27 of 16

Average Search Space Reduction 20 Search Space Reduction (×) Euclidean Manhattan 15 10 5 0 2 3 4 5 6 7 8 9 10 k Nearest Neighbor ● 10X-18X average reduction of starting search space for next nearest neighbor 28 of 16

High Performance and Energy Efficient Machine Learning Accelerators - PowerPoint PPT Presentation

High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision Circuits for Sub-10nm Microprocessors (Invited Paper) Ram Krishnamurthy Senior Principal Engineer & Director of High Performance and Low Voltage

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch

Energy Efficient Mortgages Initiative Energy efficient Mortgages Action Plan (EeMAP) Energy

A High- A High -Performance Area Performance Area- -Efficient Efficient AES Cipher on a Many

Clean Energy Sources Wind Energy Hydro-Energy Bio-Energy Solar-Energy 1 Why Clean Energy

3. High Energy Emission in Binary Systems PhD Course, University of Padua Page 1 High Energy and

Energy Efficient Channel Coding Leonardo Fagundes Luz Serrano Energy Efficient Channel Coding

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

E4 E4 E4 E4 - Energy Efficient Energy Efficient Elevators and Escalators Elevators and

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

Energy Efficient Construction Energy Efficient Construction Case Study Presentation Josh Trent,

Energy-efficient Trajectory Tracking for Mobile Devices Based on "Energy-efficient

Energy-efficient Energy-efficient Data Collection in Wireless Data Collection in Wireless

Energy Efficient Cooperative Energy Efficient Cooperative Communications Iain Collings | Deputy

Energy Efficient Enterprise Bashir Ahmad, P.Eng. ABB Inc . (Canada) ABB Inc Abstract 240 Energy

Nanomaterials for High Efficiency Energy for High Efficiency Energy Nanomaterials Conversion,

FOR GREATER RESEARCH DISCOVERY OPTIMIZING Christa Studzinski Manager, Partnerships

Automatic acquisition of Named Entities for Rule-Based Machine Translation Antonio Toral , Andy

CS 241 Data Organization Quiz 5 March 8, 2018 Question 1: Structures and Functions struct

::::::i function = O c- zero - - - t en Tn C , et , t = O = eye

The High Performance Solution of Sparse Linear Systems and its application to large 3D

Core Grammatical Evolution Generated Parallel Recursive Programs Gopinath Chennupati, R. Muhammad

PRECISION CALCULATIONS FOR FCC-ee selected examples on ( Z ) , ( W ) and Higgs production,

Outline Motivation & Goal Framework & Design Examples Future

High Performance and Energy Efficient Machine Learning Accelerators - PowerPoint PPT Presentation

High Performance and Energy Efficient Machine Learning Accelerators and Variable Precision Circuits for Sub-10nm Microprocessors (Invited Paper) Ram Krishnamurthy Senior Principal Engineer & Director of High Performance and Low Voltage

Energy-efficient &amp; High-performance Energy-efficient &amp; High-performance Instruction Fetch

Energy Efficient Mortgages Initiative Energy efficient Mortgages Action Plan (EeMAP) Energy

A High- A High -Performance Area Performance Area- -Efficient Efficient AES Cipher on a Many

Clean Energy Sources Wind Energy Hydro-Energy Bio-Energy Solar-Energy 1 Why Clean Energy

3. High Energy Emission in Binary Systems PhD Course, University of Padua Page 1 High Energy and

Energy Efficient Channel Coding Leonardo Fagundes Luz Serrano Energy Efficient Channel Coding

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

E4 E4 E4 E4 - Energy Efficient Energy Efficient Elevators and Escalators Elevators and

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

Energy Efficient Construction Energy Efficient Construction Case Study Presentation Josh Trent,

Energy-efficient Trajectory Tracking for Mobile Devices Based on &quot;Energy-efficient

Energy-efficient Energy-efficient Data Collection in Wireless Data Collection in Wireless

Energy Efficient Cooperative Energy Efficient Cooperative Communications Iain Collings | Deputy

Energy Efficient Enterprise Bashir Ahmad, P.Eng. ABB Inc . (Canada) ABB Inc Abstract 240 Energy

Nanomaterials for High Efficiency Energy for High Efficiency Energy Nanomaterials Conversion,

FOR GREATER RESEARCH DISCOVERY OPTIMIZING Christa Studzinski Manager, Partnerships

Automatic acquisition of Named Entities for Rule-Based Machine Translation Antonio Toral , Andy

CS 241 Data Organization Quiz 5 March 8, 2018 Question 1: Structures and Functions struct

::::::i function = O c- zero - - - t en Tn C , et , t = O = eye

The High Performance Solution of Sparse Linear Systems and its application to large 3D

Core Grammatical Evolution Generated Parallel Recursive Programs Gopinath Chennupati, R. Muhammad

PRECISION CALCULATIONS FOR FCC-ee selected examples on ( Z ) , ( W ) and Higgs production,

Outline Motivation &amp; Goal Framework &amp; Design Examples Future

Energy-efficient & High-performance Energy-efficient & High-performance Instruction Fetch

Energy-efficient Trajectory Tracking for Mobile Devices Based on "Energy-efficient

Outline Motivation & Goal Framework & Design Examples Future