Energy Estimation Methodology for Accelerator Designs Yannan Nellie - - PowerPoint PPT Presentation

energy estimation methodology for
SMART_READER_LITE
LIVE PREVIEW

Energy Estimation Methodology for Accelerator Designs Yannan Nellie - - PowerPoint PPT Presentation

Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs Yannan Nellie Wu 1 , Joel S. Emer 1,2 , Vivienne Sze 1 1 MIT 2 NVIDIA 1 Accelergy Overview An architecture-level energy estimator Flexibly


slide-1
SLIDE 1

1

Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs

Yannan Nellie Wu1 , Joel S. Emer1,2 , Vivienne Sze1

1 MIT 2 NVIDIA

slide-2
SLIDE 2

2

Accelergy Overview

  • An architecture-level energy estimator
  • Flexibly characterizes various basic building blocks of different

technologies

  • Succinctly models diverse and complicated designs
  • Improves estimation accuracy via fine-grained classification of operations
  • Achieves 95% accuracy in evaluating a deep neural network (DNN)

accelerator – Eyeriss [ISSCC 2016]

slide-3
SLIDE 3

3

Energy Consumption Concerns

Data and computation-intensive applications are power hungry Object Detection DNN Accelerator

We must quickly evaluate energy efficiency of arbitrary potential designs in the large design space

Database Processing Database Accelerator

slide-4
SLIDE 4

4

Energy Estimation and Design Exploration

Arch. Description

global buffer (GLB)

buffer MAC

PE*

processing element

component abstract hierarchy

slide-5
SLIDE 5

5

  • Physical-Level Energy Estimator (Synopsys Prime Power, Cadence Joules)

Energy Estimation and Design Exploration

Synthesize the design, place standard cells, and route the wires

RTL Model Physical Layout

Develop the register transfer level (RTL) details

Arch. Description

Energy

NOR3 OR4 OR2

wire0 wire1

Requires physical layout of the design

slide-6
SLIDE 6

6

Energy Estimation and Design Exploration

  • Physical-Level Energy Estimator (Synopsys Prime Power, Cadence Joules)

RTL Model Physical Layout Arch. Description

Energy

Fabricated Chip Requires physical layout of the design Slow design space exploration

slide-7
SLIDE 7

7

Accelergy Overview

  • An architecture-level energy estimator
  • Flexibly characterizes various basic building blocks in the design
  • Succinctly models diverse and complicated designs
  • Improves estimation accuracy via fine-grained classification of operations
  • Achieves 95% accuracy in evaluating a deep neural network (DNN)

accelerator – Eyeriss [ISSCC 2016]

slide-8
SLIDE 8

8

Energy Estimation and Design Exploration

  • Architecture-Level Energy Estimators

Arch Description

Energy

Physical Layout Fabricated Chip RTL Model

Only requires architecture-level design Fast design space exploration global buffer (GLB)

buffer MAC

PE*

processing element

slide-9
SLIDE 9

9

Existing Architecture-Level Energy Estimators

  • Design-Specific Accelerator Estimators: Aladdin[ISCA2014], fixed-cost[Asilomar2017]

Description with primitive components (basic building blocks)

Energy Estimator

GLB buffer MAC PE

Architecture Description

slide-10
SLIDE 10

10

Existing Architecture-Level Energy Estimators

  • Design-Specific Accelerator Estimators: Aladdin[ISCA2014], fixed-cost[Asilomar2017]

GLB buffer MAC PE

Architecture Description Energy Estimator

Comp. Action Energy GLB access() 100pJ buffer access() 10pJ MAC compute() 5pJ Energy Reference Table (ERT)

slide-11
SLIDE 11

11

Existing Architecture-Level Energy Estimators

  • Design-Specific Accelerator Estimators: Aladdin[ISCA2014], fixed-cost[Asilomar2017]

GLB buffer MAC PE

Architecture Description Energy Estimator

Comp. Action Energy GLB access() 100pJ buffer access() 10pJ MAC compute() 5pJ Energy Reference Table (ERT)

Action Counts

Comp. Action Counts GLB

access()

10 buffer access() 800 MAC compute() 400

slide-12
SLIDE 12

12

Existing Architecture-Level Energy Estimators

  • Design-Specific Accelerator Estimators: Aladdin[ISCA2014], fixed-cost[Asilomar2017]

GLB buffer MAC PE

Architecture Description Energy Estimator

Comp. Action Energy GLB access() 100pJ buffer access() 10pJ MAC compute() 5pJ Energy Reference Table (ERT)

Energy Calculator

Action Counts

Name Energy GLB 1000pJ buffer 8000pJ MAC 2000pJ

Energy Estimations

Comp. Action Counts GLB

access()

10 buffer access() 800 MAC compute() 400

slide-13
SLIDE 13

13

Comp. Action Counts GLB

access()

10 buffer access() 800 MAC compute() 400

Existing Architecture-Level Energy Estimators

  • Design-Specific Accelerator Estimators: Aladdin[ISCA2014], fixed-cost[Asilomar2017]

GLB buffer MAC PE

Architecture Description Energy Estimator

Comp. Action Energy GLB access() 100pJ buffer access() 10pJ MAC compute() 5pJ

Energy Calculator GLB’

GLB’

Not generalizable to other designs

Energy Reference Table (ERT)

Action Counts

slide-14
SLIDE 14

14

Accelergy Overview

  • An architecture-level energy estimator
  • Flexibly characterizes various primitive components of different

technologies

  • Succinctly models diverse and complicated designs
  • Improves estimation accuracy via fine-grained classification of operations
  • Achieves 95% accuracy in evaluating a deep neural network (DNN)

accelerator – Eyeriss [ISSCC 2016]

slide-15
SLIDE 15

15

Accelergy: Flexibly Model Various Primitive Components

GLB buffer MAC PE

Architecture Description Accelergy

ERT Generator

Primitive Component Library

… CACTI Estimation Plug-in 40nm Estimation Plug-in

SRAM SRAM type has associated action “access”

slide-16
SLIDE 16

16

Accelergy: Flexibly Model Various Primitive Components

GLB buffer MAC PE

Architecture Description Accelergy

ERT Generator

Primitive Component Library

… CACTI Estimation Plug-in 40nm Estimation Plug-in

Comp. Action Energy GLB access() 100pJ

ERT (in progress)

SRAM SRAM type has associated action “access”

slide-17
SLIDE 17

17

Comp. Action Energy GLB access() 100pJ buffer access() 10pJ MAC compute() 5pJ

Accelergy: Flexibly Model Various Primitive Components

GLB buffer MAC PE

Architecture Description Accelergy

ERT Generator

Primitive Component Library

… CACTI Estimation Plug-in 40nm Estimation Plug-in

ERT

slide-18
SLIDE 18

18

Accelergy: Flexibly Model Various Primitive Components

GLB buffer MAC PE

Architecture Description Accelergy

ERT Generator

Primitive Component Library

… CACTI Estimation Plug-in 40nm Estimation Plug-in

ERT

Energy Calculator

slide-19
SLIDE 19

19

Accelergy: Flexibly Model Various Primitive Components

GLB buffer MAC PE

Architecture Description Accelergy

ERT Generator

Primitive Component Library

… CACTI Estimation Plug-in 40nm Estimation Plug-in

ERT

Energy Calculator

Comp. Action Counts GLB

access()

10 buffer access() 800 MAC compute() 400 Name Energy GLB 1000pJ Buffer 8000pJ MAC 2000pJ

Action Counts Energy Estimates

slide-20
SLIDE 20

20

Use energy estimation plug-ins to characterize primitive components CACTI Estimation Plug-in 40nm Estimation Plug-in

Traditional open-source plug-ins*

*available at http://accelergy.mit.edu

Proprietary plug-ins

Accelergy: Flexibly Model Various Primitive Components

Emerging technology plug-ins

NVSIM

[TCAD 2012]

slide-21
SLIDE 21

21

Use energy estimation plug-ins to characterize primitive components CACTI Estimation Plug-in 40nm Estimation Plug-in

Traditional open-source plug-ins*

*available at http://accelergy.mit.edu

Proprietary plug-ins

Accelergy: Flexibly Model Various Primitive Components

Emerging technology plug-ins

NVSIM

[TCAD 2012]

Detailed plug-in interface in open-source repo

slide-22
SLIDE 22

22

Modeling Complicated Designs

  • Practical architecture designs involve much more details

buffer SRAM

  • AG_SRAM is an abstract hierarchy
  • Buffer is of SRAM type
  • AGs is of counter type

AG_SRAM data

  • ut

AG[0]

read address

counter AG[1]

write address

data in

counter

– Example: storage units with local address generators (AGs)

slide-23
SLIDE 23

23

  • Practical architecture designs involve much more details

– Example: storage units with local address generators (AG)

GLB Buffer MAC PE

Modeling Complicated Designs

GLB buffer Let’s construct a more practical design!

MAC

slide-24
SLIDE 24

24

MAC

  • Practical architecture designs involve much more details

– Example: storage units with local address generators (AG)

PE

Modeling Complicated Designs

GLB buffer Let’s construct a more practical design!

slide-25
SLIDE 25

25

Modeling Complicated Designs

Accelergy

ERT Generator

Primitive Component Library

… CACTI Estimation Plug-in 40nm Estimation Plug-in Energy Calculator

Architecture Description

Name Action Counts PE[0]. AG[0] count 50 PE[0]. AG[1] count 50

  • Action counts are even

more tedious

  • Small modification requires

new action counts

Action Counts

  • Architecture

description is tedious

  • Hard to make

modifications

slide-26
SLIDE 26

26

Existing Work - Modeling Complicated Designs

  • Existing work that aims to succinctly model complicated architectures

– Wattch[ISCA2000], McPAT[MICRO2009] ALU L1 $ ROB … L2 $ CPU-Centric Architecture Model

Use a fixed set of compound components to represent the architecture

Components that can be decomposed into lower level components

slide-27
SLIDE 27

27

Existing Work - Modeling Complicated Designs

  • Existing work that aims to succinctly model complicated architectures

– Wattch[ISCA2000], McPAT[MICRO2009] ALU L1 $ ROB … L2 $ CPU-Centric Architecture Model The fixed set of compound components is not sufficient to describe arbitrary accelerator designs

slide-28
SLIDE 28

28

Accelergy Overview

  • An architecture-level energy estimator
  • Flexibly characterizes various primitive components of different

technologies

  • Succinctly models diverse and complicated designs
  • Improves estimation accuracy via fine-grained classification of operations
  • Achieves 95% accuracy in evaluating a deep neural network (DNN)

accelerator – Eyeriss [ISSCC 2016]

slide-29
SLIDE 29

29

Architecture described with abstract hierarchies

PE

GLB AG_SRAM

Accelergy: Succinctly Model Arbitrary Architecture

AG_SRAM

buffer SRAM AG[0]

read address

counter AG[1]

write address

counter

AG_SRAM is an abstract hierarchy AG_SRAM is an user-defined compound component Architecture described with compound components and primitive components

Design MAC mac

buffer buffer

AG_SRAM

GLB AG_SRAM

slide-30
SLIDE 30

30

Accelergy: Succinctly Model Arbitrary Action Counts

Name Action Counts GLB.AG[0] count() 50 GLB.AG[1] count() 20 GLB.buffer read() 50 GLB.buffer write() 20 …

Tedious action counts in terms of primitive component actions Action Counts

slide-31
SLIDE 31

31

Accelergy: Succinctly Model Arbitrary Action Counts

User-defined compound actions

AG_SRAM.read() AGs[0].count() buffer.read() GLB

AG_SRAM

buffer

AG_SRAM

GLB

AG_SRAM

slide-32
SLIDE 32

32

Accelergy: Succinctly Model Arbitrary Action Counts

AG_SRAM.read() AGs[0].count() buffer.read() GLB

AG_SRAM Name Action Counts GLB read() 50 GLB write() 20

buffer

AG_SRAM

GLB

AG_SRAM

Action Counts Succinct action counts with compound component actions User-defined compound actions

slide-33
SLIDE 33

33

Accelergy: Succinctly Model Complex Designs

Accelergy

ERT Generator

Primitive Component Library

… CACTI Estimation Plug-in 40nm Estimation Plug-in Energy Calculator

Compound Component Description Architecture Description

Comp. Actions Energy GLB read(), … 120pJ, … PE[0].buffer read(), … 12pJ, … PE[0].MAC compute(), … 5pJ, …

ERT

MAC

mac

FIFO

FIFO

MAC_FIFO

slide-34
SLIDE 34

34

Accelergy: Succinctly Model Complex Designs

Accelergy

ERT Generator

Primitive Component Library

… CACTI Estimation Plug-in 40nm Estimation Plug-in Energy Calculator

Name Action Counts GLB read() 50 PE[0]. buffer read() 60

Action Counts Compound Component Description Architecture Description Energy Estimations

MAC

mac

FIFO

FIFO

MAC_FIFO

Comp. Actions Energy GLB read(), … 120pJ, … PE[0].buffer read(), … 12pJ, … PE[0].MAC compute(), … 5pJ, …

ERT

slide-35
SLIDE 35

35

Additional Challenge: Inaccurate Modeling of Energy/Action

  • Existing architecture-level energy estimators only model coarse action types

Component Action Energy GLB access() 100pJ Buffer access() 10pJ ALU compute() 5pJ

Coarse-grained Actions

Energy-Per-Actions of a Register File (normalized to idle)

1.8 1.0 4.7 2.1 2.4 Random Read Repeated Read Random Write Repeated Write Constant Data Write

Fine-grained Actions Coarse-grained estimations reduce estimation accuracies ~5x

slide-36
SLIDE 36

36

Accelergy Overview

  • An architecture-level energy estimator
  • Flexibly characterizes various primitive components of different

technologies

  • Succinctly models diverse and complicated designs
  • Improves estimation accuracy via fine-grained actions
  • Achieves 95% accuracy in evaluating a deep neural network (DNN)

accelerator – Eyeriss [ISSCC 2016]

slide-37
SLIDE 37

37

Accelergy: Fine-grained Action Classification

  • Accurate estimation with a primitive component library

Defines the fine-grained actions for each primitive component

1.8 1.0 4.7 2.1 2.4 Random Read Repeated Read Random Write Repeated Write Constant Data Write

~5x

23.0 16.8 1.3 Random Mult Reused Mult Gated Mult

~20x

Fine-grained multiplier action types Fine-grained memory action types

slide-38
SLIDE 38

38

Accelergy: Fine-grained Action Classification

  • Accurate estimation with a primitive component library*

Defines the fine-grained actions for each primitive component

1.8 1.0 4.7 2.1 2.4 Random Read Repeated Read Random Write Repeated Write Constant Data Write

~5x

23.0 16.8 1.3 Random Mult Reused Mult Gated Mult

~20x

Fine-grained multiplier action types Fine-grained memory action types

Detailed methodology for generating fine-grained action types in paper

slide-39
SLIDE 39

39

Accelergy Overview

  • An architecture-level energy estimator
  • Flexibly characterizes various primitive components of different

technologies

  • Succinctly models diverse and complicated designs
  • Improves estimation accuracy via fine-grained actions
  • Achieves 95% accuracy in evaluating a deep neural network (DNN)

accelerator – Eyeriss [ISSCC 2016]

slide-40
SLIDE 40

40

Energy Evaluations on Eyeriss

  • Experimental Setup:

– Workload: Alexnet weights & ImageNet input feature maps – Ground Truth: Energy obtained from post-layout simulations PE weights_spad ifmap_spad psum_spad MAC

Ifmap = input feature map Psum = partial sum PE = processing element *_spad = *_scratchpad

WeightsNoC IfmapNoC PsumWrNoC

Eyeriss Architecture

GLBs Weights GLB Shared GLB PE array 12x14 PE PE … PE PE PE PE PE PE PE … … … … …

PsumRdNoC

slide-41
SLIDE 41

41

Energy Evaluations on Eyeriss

  • Experimental Setup:

– Workload: Alexnet weights & ImageNet input feature maps – Ground Truth: Energy obtained from post-layout simulations PE weights_spad ifmap_spad psum_spad MAC

Ifmap = input feature map Psum = partial sum PE = processing element *_spad = *_scratchpad

WeightsNoC IfmapNoC PsumWrNoC

Eyeriss Architecture

GLBs Weights GLB Shared GLB PE array 12x14 PE PE … PE PE PE PE PE PE PE … … … … …

PsumRdNoC

Zero-gating optimization If there is a 0 ifmap data

  • Gate on reading the weights data => gated-read
  • Gate on computing the MAC => gated-MAC
slide-42
SLIDE 42

42

Total Energy and Coarse Energy Breakdown

  • Total energy estimation is 95% accurate of the post-layout energy.
  • Estimated relative breakdown of the important units in the design is

within 8% of the post-layout energy.

slide-43
SLIDE 43

44

PE Array Energy Breakdown

Energy Breakdown of PEs across the Array

0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26

Energy Consumption (µ J) PEs that process data of different sparsity

ground truth Accelergy Aladdin fixed-cost

  • Comparisons with existing work: Aladdin and fixed-cost
slide-44
SLIDE 44

45

PE Array Energy Breakdown

Energy Breakdown of PEs across the Array

0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26

Energy Consumption (µ J) PEs that process data of different sparsity

ground truth Accelergy Aladdin fixed-cost

  • Comparisons with existing work: Aladdin and fixed-cost

Not aware of the fine- grained actions related to zero-gating optimization

slide-45
SLIDE 45

46

PE Array Energy Breakdown

Energy Breakdown of PEs across the Array

0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26

Energy Consumption (µ J) PEs that process data of different sparsity

ground truth Accelergy Aladdin fixed-cost

  • Comparisons with existing work: Aladdin and fixed-cost

Inaccurate energy characterization of components

slide-46
SLIDE 46

47

PE Energy Breakdown

  • Comparisons with existing work: Aladdin and fixed-cost

10 20 30 40 50 60 70 80 90 100 ifmap_spad psum_spad weights_spad MAC

Energy Consumption (n J) ground truth Accelergy Aladdin fixed-cost Energy Breakdown of components inside a PE

Zero-gating action type not reflected

slide-47
SLIDE 47

48

PE Energy Breakdown

  • Comparisons with existing work: Aladdin and fixed-cost

10 20 30 40 50 60 70 80 90 100 ifmap_spad psum_spad weights_spad MAC

Energy Consumption (n J) ground truth Accelergy Aladdin fixed-cost Energy Breakdown of components inside a PE

All local scratchpads share the same energy reference table

slide-48
SLIDE 48

49

Conclusion

  • Accelergy is an architecture-level energy estimator that

–Accelerates accelerator design space exploration – Provides flexibility to

  • Describe a diverse range of accelerator designs
  • Support estimation of different technologies, e.g., CMOS, RRAM, optical

– Achieves high accuracy energy estimations

  • 95% accurate for the Eyeriss accelerator
  • Open-source code available at: http://accelergy.mit.edu

Acknowledgement: DARPA, Facebook, MIT Presidential Fellowship