Approach in ML Architecture" Professor Uri Weiser Viterbi - - PowerPoint PPT Presentation

approach in ml architecture
SMART_READER_LITE
LIVE PREVIEW

Approach in ML Architecture" Professor Uri Weiser Viterbi - - PowerPoint PPT Presentation

"The Next Challenge: Energy Efficient Approach in ML Architecture" Professor Uri Weiser Viterbi Faculty of Electrical Engineering Uri Weiser The Technion Israel UPC October 10 th 2018 July 1 st 2019 Contributors to the research:


slide-1
SLIDE 1

1

"The Next Challenge: Energy Efficient Approach in ML Architecture"

Uri Weiser UPC October 10th 2018

The presentation is based on work by: Gil Shomron, Daniel Raskin,, Loren Jammal, Avi Baum, Yoav Etsion

Professor Uri Weiser Viterbi Faculty of Electrical Engineering The Technion Israel

July 1st 2019

Contributors to the research: Leeor Peled, Daniel Raskin, Gil Shomron, Leonid Yavits, Moran Shkolnik, Avi Baum,

slide-2
SLIDE 2

To Yale

5 years passed since Yale@75,

2

OK but why do you have to drag us with you?

Interesting how you keep staying in the center…

slide-3
SLIDE 3

3 Beauty comes shining through not only when blooming

slide-4
SLIDE 4

Agenda:

  • Technology environment
  • Process is slowing down
  • Big Data
  • Funnel
  • Killer apps ➔ ML
  • Efficient ML BASICS
  • Energy:
  • Amdahl and MA (divide effectively our limited resources)
  • SMT – is this a biggy?
  • Pipeline – why?
  • Map applications to HW – Data Flow concept
  • Prediction – no validation is necessary
  • Conclusions

4 4

slide-5
SLIDE 5

*ACMqueue April 6, 2012, Processors, Volume 10, issue 4 CPU DB: Recording Microprocessor History, Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University

Technology environment

Performance History

We (the architects) did an “OK-” job

Total impact 2,000X Process impact

100X Relative Performance

Process

uArch impact

20X

Feature size [um]

5 5

slide-6
SLIDE 6

Input: Unstructured data

Big Data ➔ usage of DATA

➔ Extract Transformed Load ➔ Read Once ➔ Non-Temporal Memory Access

Funnel

beta= BWout BWin 6

slide-7
SLIDE 7

Killer applications*

7

*Applications you can not effectively execute on current HW (Dr. Andy Grove)

  • ML is one!?
  • Funnel (in most of the cases)
  • Input: huge amount of data

Output: small amount

  • Many simple operations
slide-8
SLIDE 8

Energy in “Data Flow” architecture

8

Now Read Once counts!

25pJ 45pJ 6pJ

I-Cache Access Control

  • Reg. File

Access

OP.

Data access

0.5pJ

Instruction energy breakdown “Data flow” energy breakdown

slide-9
SLIDE 9
  • Energy ➔ Performance
  • Map applications to HW ➔ Graph mapping;

data flow

  • Efficient mapping
  • Co-design HW structure and smart compiler

in specific application environment

  • Almost no flow control
  • Statistical results – no need to validate

execution

9

Efficient ML I Accelerator

slide-10
SLIDE 10
  • Energy ➔ Performance
  • System vs. Accelerators: It is Amdahl again!
  • Energy reduction
  • Reduction in Computing (MACs op.)
  • Pruning
  • Prediction
  • Reduction in data access and movement
  • Pipeline
  • Efficient usage of the Hardware resources
  • Multi-Amdahl (divide effectively your limited resources)
  • SMT

10

Efficient ML II Balanced design and energy reduction

slide-11
SLIDE 11

0.1pJ/OP 0.01pJ/OP TOPS/W drop due to inefficiency (e.g. data movement, DRAM repeated accesses…) Energy efficiency α energy/OP

ISSCC Feb 17th 2019 preen announcement

Energy efficiency

Throughput

Efficient ML II: Reduction in Computing 1 2

11

slide-12
SLIDE 12
  • Reduction in Computing ➔

reduce # of operations via

  • Pruning
  • Well known techniques
  • Value Data (Prediction)
  • ML are statistical ➔ no need to validate execution

12

  • G. Shomron, U. Weiser, “Spatial Correlation and Value Prediction in Convolutional Neural Networks”

IEEE Computer Architecture Letters (CAL) Journal January 2019

Efficient ML II: Reduction in Computing (1)

slide-13
SLIDE 13

13

Efficient ML II: Reduction in Data Accesses (2)

  • Reduction in Data access and movements
  • Pipeline execution

➔ Data stays on die

MAC

Memory (DRAM)

MAC MAC MAC

I R

Memory (SRAM)

IN = IN Out

Layer 1 Layer 2 Layer 3 Layer n

slide-14
SLIDE 14
  • Multi-Amdahl* (divide effectively your limited resources)

e.g. efficient resource division (e.g. SRAM)*

  • SMT**
  • Resources needs are known ahead of time…

14

Efficient ML II: Efficient usage of HW Optimization using Lagrange multipliers

Target under a constraint A F’= derivation of the accelerator Function t2 t3 tn t1

F1(a1) F2(a2) Fn(an)

ti Fi’(ai) = tj Fj’(aj)

** *G. Shomron, T. Horowitz, U. Weiser, “SMT-SA: Simultaneous Multithreading in Systolic Arrays” IEEE Computer Architecture Letters (CAL) Journal July 2019 ** Technion EE, Advanced Microarchitecture course’s Exam (winter 2019) *T. Zidenberg, Isaac Keslassy, U. Weiser, “Optimal Resource Allocation with MultiAmdahl” IEEE MICRO Journal August 2013

slide-15
SLIDE 15

15

Conclusions

  • Opportunities:
  • Map application to HW
  • Reduce energy per operation?
  • Reduce # of operations
  • Reduce data movement and memory access
  • Efficient usage of HW
  • We’re gonna have fun
  • Open field, lots of ideas, many researchers
  • Opportunities
  • New passionate energy in the community
  • Back to the “big impact” era…
slide-16
SLIDE 16

Thank You

16