LOW-POWER HIGH-PERFORMANCE ASYNCHRONOUS GENERAL PURPOSE ARMv7 - - PowerPoint PPT Presentation

low power high performance asynchronous
SMART_READER_LITE
LIVE PREVIEW

LOW-POWER HIGH-PERFORMANCE ASYNCHRONOUS GENERAL PURPOSE ARMv7 - - PowerPoint PPT Presentation

LOW-POWER HIGH-PERFORMANCE ASYNCHRONOUS GENERAL PURPOSE ARMv7 PROCESSOR FOR MULTI-CORE APPLICATIONS 13 th International Forum on Embedded MPSoC and Multicore July 15-19 th 2013, Otsu, Japan Octasic Inc, Montral, Canada Michel Laurence


slide-1
SLIDE 1

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

1

LOW-POWER HIGH-PERFORMANCE ASYNCHRONOUS GENERAL PURPOSE ARMv7 PROCESSOR FOR MULTI-CORE APPLICATIONS

13th International Forum on Embedded MPSoC and Multicore July 15-19th 2013, Otsu, Japan

Michel Laurence michel.laurence@octasic.com

Octasic Inc, Montréal, Canada

slide-2
SLIDE 2

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

2

FOREWORD

  • At MPSoC 2012 I presented a multi-core asynchronous

DSP architecture:

− High Computing Performance − Very Energy/Power Efficiency

  • We were wondering if the same architecture applied

to a general purpose processor (like ARM) could deliver similar performance/power gains.

  • This presentation provides a summary of the results
  • btained so far.
slide-3
SLIDE 3

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

3

CONTENTS

Perspective Background Processor Architecture and Operation Performance Analysis Conclusion

slide-4
SLIDE 4

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

4

THE CHALLENGE OF MULTI-CORE “DARK SILICON”

Paper in COMMUNICATIONS OF THE ACM, Feb 2013 :

Power Challenges May End the Multicore Era*

“As the number of cores increases, power constraints may prevent powering of all cores at their full speed, requiring a fraction of the cores to be powered off at all

  • times. According to our models, the fraction of these chips that is “dark” may be as

much as 50% within three process generations. The low utility of this “dark silicon” may prevent both scaling to higher core counts and ultimately the economic viability

  • f continued silicon scaling.

. . . Without a breakthrough in process technology or microarchitecture, other directions are needed to continue the historical rate of performance improvement.”

*By Esmaeilzadeh, Blem, St-Amand, Sankaralingam, & Burger Mike Muller, CTO of ARM had made similar warnings in 2010

slide-5
SLIDE 5

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

5

EXTENDING THE LIFE OF MULTI-CORE

  • Octasic has developed an Asynchronous core micro-

architecture which increases processor ( processing efficiency by a factor of 2-3x

  • This presentation explores if the application of the micro-

architecture to a general purpose processor core would entail the same or similar benefits

slide-6
SLIDE 6

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

6

CONTENTS

Overview Background

  • Octasic
  • Why Asynchronous
  • ARM Core Project Objectives

Processor Architecture and Operation Performance Analysis Conclusion

slide-7
SLIDE 7

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

7

BACKGROUND ON OCTASIC

Founded 15 years ago. Currently ~100 employees Headquartered in Montreal, Canada

  • Subsidiary in Bangalore, India

Evolution:

  • 98/00 - Design ASICs for others
  • 2001 - Convert to fabless model
  • 2001- 2003: VoIP Support Products (Synchronous):

− 2001 - Voice Packetization Engine / OCT8304 − 2003 - Echo Cancellation Processor / OCT6100

  • 2004 – DSPs (Asynchronous) for Voice, Video, and Wireless Baseband

− 2008 - First Generation / OCT1010 − 2011 - Second Generation / OCT2224 − …2014 - Third Generation / OCT3XXX

slide-8
SLIDE 8

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

8

CONTENTS

Overview Background

  • Octasic
  • Why Asynchronous
  • ARM Core Project Objectives

Processor Architecture and Operation Performance Analysis Conclusion

slide-9
SLIDE 9

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

9

BASICS OF ASYNCHRONOUS TECHNOLOGY

With synchronous technology

  • The control of the flow of information in a chip is controlled by a clock or a set of

clocks

  • This is analogous to the traffic flow

control in a city with traffic lights

With asynchronous technology

  • The control of the flow of information in a chip is controlled by feedback from
  • ne circuit to the other
  • This is analogous to the traffic flow control in a city via round-abouts rather than

traffic lights

slide-10
SLIDE 10

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

10

BASICS OF ASYNCHRONOUS TECHNOLOGY

There are advantages and disadvantages with both methodologies: With synchronous methodology (traffic lights):

  • the flow of traffic is centrally controlled, deterministic, hence more easily

modelled, tools are easier to implement

  • but there are inefficiencies – cars can be waiting uselessly on a red light

while there is no traffic in the perpendicular direction. … and clocks contrary to traffic lights consume a LOT OF ENERGY.

With asynchronous methodology (round-abouts)

  • the flow of traffic is decentralized, thus less deterministic with tools not

as easy to develop and use

  • traffic can be more efficient, each car can proceed at its optimal speed

not at a fixed forced speed, and overall save fuel

?

slide-11
SLIDE 11

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

11

CONTENTS

Overview Background

  • Octasic
  • Why Asynchronous
  • ARM Core Project Objectives

Processor Architecture and Operation Performance Analysis Conclusion

slide-12
SLIDE 12

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

12

ARM CORE PROJECT OBJECTIVES

Must be functionally identical with ARMv7

  • Object code compatible
  • Single thread performance parity

− May improve performance with “tuned” compiler

Must be able to use off-the-shelf IDE tools

  • Debug interface compatibility

− Coresight compatibility

Must Deliver 2-3x Processing Efficiency (Energy)

  • Same performance using ½ – ⅓ the power
slide-13
SLIDE 13

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

13

CONTENTS

Perspective Background Processor Architecture and Operation (simplified)

  • Octasic Async Principles
  • Architecture, Silicon, and ILP Implementation
  • Operation & Synchronization
  • Putting it all together

Performance Analysis Conclusion

slide-14
SLIDE 14

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

14

OCTASIC ASYNCHRONOUS TECHNOLOGY

Octasic Asynchronous Architecture is loosely characterized as: Single Rail Bundled Data (SRBD) Traditionally with SRBD each forward path stage is timed by handshake feedback from next stage for availability (ACK)

C C C

ACK ACK ACK REQ REQ REQ ACK REQ

LATCH LATCH LATCH

EN EN EN

This requires Special Silicon Cell & Specialized Timing Tools

slide-15
SLIDE 15

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

15

OCTASIC ASYNCHRONOUS TECHNOLOGY

C C C

ACK ACK ACK REQ REQ REQ ACK REQ

LATCH LATCH LATCH

EN EN EN

“ACK” REQ REQ REQ REQ

LATCH LATCH LATCH

EN EN EN Rate Limit Rate Limit Rate Limit

“ACK” “ACK”

Traditional Octasic has modified the approach - no ACK but a rate limiter:

  • simplified circuit
  • no special silicon cell
  • standard design tools
slide-16
SLIDE 16

EXAMPLE: OCTASIC SIMPLIFIED EXECUTION UNIT

slide-17
SLIDE 17

OCTASIC SIMPLIFIED EXECUTION UNIT

  • The operand state registers are asynchronously loaded
slide-18
SLIDE 18

OCTASIC SIMPLIFIED EXECUTION UNIT

  • The operand state registers are asynchronously loaded
  • The instruction state register is asynchronously loaded
slide-19
SLIDE 19

OCTASIC SIMPLIFIED EXECUTION UNIT

  • The operand state registers are asynchronously loaded
  • The instruction state register is asynchronously loaded
  • When ready (input registers loaded & output register released) a launch pulse is generated
slide-20
SLIDE 20

OCTASIC SIMPLIFIED EXECUTION UNIT

  • The operand state registers are asynchronously loaded
  • The instruction state register is asynchronously loaded
  • When ready (input registers loaded & output register released) a launch pulse is generated
  • Delay chain timing is modulated according to instruction
slide-21
SLIDE 21

OCTASIC SIMPLIFIED EXECUTION UNIT

  • The operand state registers are asynchronously loaded
  • The instruction state register is asynchronously loaded
  • When ready (input registers loaded & output register released) a launch pulse is generated
  • Delay chain timing is modulated according to instruction
  • Output state register is asynchronously loaded with result of instruction
slide-22
SLIDE 22

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

22

BENEFITS OF OCTASIC’S APPROACH

Uses only standard ASIC library elements

  • No custom cell
  • Ease of porting - from one silicon node to the next / from one

vendor to another Can use standard CAD tools and concepts

  • To facilitate sign-off
  • To facilitate staff conversion training

Uses standard ATPG tools and principles

  • Ensures manufacturability and reliability
slide-23
SLIDE 23

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

23

CONTENTS

Perspective Background Processor Architecture and Operation (simplified)

  • Octasic Async Principles
  • Architecture, Silicon, and ILP Implementation
  • Operation & Synchronization
  • Putting it all together

Performance Analysis Conclusion

slide-24
SLIDE 24

SYNC VS ASYNC PROCESSOR IMPLEMENTATION

Conversion Sync => Async:

  • Each unit functionality is

maintained

  • Without pipelining the

async version will be slower

  • How can async architecture

implements ILP to get back performance?

MEM load/store not show

Fetch Decode Reg Reads Execute Branch

Output Write

Store Fetch Unit Decode Unit Sync Execution Unit Store

F0 F1 F2 D0 D1 D2 E0 E1 E2 E3 E4 E5 S0 Logic Cloud

State State

Async Execution Unit

Instruction Level Parallelism (ILP) is key to increase computing performance:

  • In sync design Pipelining is used for ILP
slide-25
SLIDE 25

ASYNC ILP IMPLEMENTATION (1)

slide-26
SLIDE 26

ASYNC ILP IMPLEMENTATION (2)

To multiply the computing power or capacity of our processor we could use multiple Exec Units (EUs) operating in parallel, ... much like is done in multi-processor and multi-core designs! Now how can we transparently weave together those EUs ... ....so they behave as one CPU?

slide-27
SLIDE 27

ASYNC PROCESSOR ARCHITECTURE (2)

  • Starting with the 8 execution units …
slide-28
SLIDE 28

ASYNC PROCESSOR ARCHITECTURE (3)

  • Adding a non-blocking combinatorial X-Bar switch to:
  • connect the execution units data paths among themselves, and
  • with external resources – register file, memory, etc.
slide-29
SLIDE 29

ASYNC PROCESSOR ARCHITECTURE (4)

  • Adding a CPU Register File to implement a load/store design:
slide-30
SLIDE 30

ASYNC PROCESSOR ARCHITECTURE (5)

  • Adding a Data Memory Load/Store unit
  • to be able to load/store memory data into/from the CPU (registers)
slide-31
SLIDE 31

ASYNC PROCESSOR ARCHITECTURE (6)

  • Adding a Program Counter Control unit inc a branch predictor;
  • Coupled with an Instruction Fetch & Decode Unit
  • to be able to load instructions into the execution units
slide-32
SLIDE 32

ASYNC PROCESSOR ARCHITECTURE (7)

  • Adding L1 Memory accessible for:
  • Data, or
  • Code

A few given characteristics

  • f the architecture to

help increase performance and save power:

  • Loops
  • Register Shadowing
slide-33
SLIDE 33

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

33

CONTENTS

Perspective Background Processor Architecture and Operation (simplified)

  • Octasic Async Principles
  • Architecture, Silicon, and ILP Implementation
  • Operation & Synchronization
  • Putting it all together

Performance Analysis Conclusion

slide-34
SLIDE 34

OPERATION AND SYNCHRONIZATION (1)

ETC.

COMMON RESOURCES: Regs + Mem.

EU 1 EU 2 EU 3 EU 4 EU 5 EU 6 EU N

This is an alternate simplified processor block diagram:

  • the execution units (EUs)

are mapped in a ring like fashion

  • the EUs have access

to common resources:

  • Register File
  • Data Memory
  • Code Memory
  • X-Bar
  • PC Control Logic
  • a synchronization mechanism

is required to arbitrate and avoid conflict in the access

  • f the EUs to the common resources
slide-35
SLIDE 35

OPERATION AND SYNCHRONIZATION (2)

The operation of a synchronous processor is generally centrally controlled. This asynchronous processor has a fully distributed control structure:

  • Control is exercised individually by each Execution Unit (EU)
  • Control tokens are passed asynchronously among the EUs in a ring fashion

to synchronize accesses to common resources and avoid conflicts

  • In the simplified model discussed herein, six (6) tokens are used:
  • Instruction Fetch Token
  • Register Read Token
  • Launch Execution Token (X-Bar, Reg Ready)
  • No Mis-Prediction Token (PC & Write Commit)
  • Data Memory Token (Rd or Wr)
  • Register Write Token

D G Q TOKEN OUT TOKEN IN READY RESOURCE_REQ ACCESS LOGIC

slide-36
SLIDE 36

OPERATION AND SYNCHRONIZATION (3)

Asynchronous control tokens are used to control and synchronize the overall operation of the processor.

  • Control tokens are passed

from one EU to the next in a ring fashion.

  • When a token is owned by

an EU it can use it to request services (via Req pulses)

  • When a service request is sent

and a certain time has elapsed and certain conditions are met,

  • r when the EU does not need

the token (resource) the token is passed to the next EU.

  • On start up or after a flush

(wrongly predicted branch), all tokens are assigned to the same EU.

slide-37
SLIDE 37

PROCESSOR OPERATION – SIMPLIFIED ILP (1)

Assuming the operation of the Execution Units and resources (registers, memory, …) are somehow synchronized, here is the flow of instructions overlap that would result in the processor; hence realizing the Instruction Level Parallelism (ILP) mechanism to boost performance

I1: add r4,r3, r9 Time (pico-seconds) I2: sub r7,r4,#0x01 I3: orr r4,r3,#0x01 I4: add r7,r7,r3,lsl r5 I5: ldr r9,r7,r2 I6: sub r7,r4,#0x01 I17: sub r2,r4,#0x47

M R7

EU0 EU1 EU2 EU3 EU4 EU5

R3,R9 R4 R3 R7 R3,R5 R2 R4

EU6-EU15 Time (instruction cycles)

= Decode Instr. = Fetch Instr. = Load Reg. = Execute Instr. = Write Output Reg. = Memory access

slide-38
SLIDE 38

PROCESSOR OPERATION ILP: REAL-WORLD EXAMPLE (2)

= Decode Instr. = Fetch Instr. = Load Reg. = Execute Instr. = Write Output Reg. = Memory access

Time (pico-seconds) Program instruction Flow Note: Dependencies are no different than in the case of synchronous pipelined processors. However in the event of a pipeline stall, no dynamic power is consumed.

slide-39
SLIDE 39

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

39

CONTENTS

Perspective Background Processor Architecture and Operation (simplified)

  • Octasic Async Principles
  • Architecture, Silicon, and ILP Implementation
  • Operation & Synchronization
  • Putting it all together

Performance Analysis Conclusion

slide-40
SLIDE 40

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

40 40

ARM BLOCK DIAGRAM

Data Registers

Write Port Read Port

Token color coding: Launch Register Read PC Update Multiplication Instruction Memory Data Memory Register Write Synchronous Modules

XU14 XU1 XU0 XU15 Crossbar Mux

Program Counter Instruction FIFO Flush & Jump Dest Data Access FIFO Data Memory Instruction Memory Feedback Engine Branch Prediction

To / From L2

MU1 (shared with XU9) MU7 (shared with XU7) MU0 (shared with XU8) MU6 (shared with XU6)

slide-41
SLIDE 41

41

Typical ARM Execution Unit (EU) Implementation

Cntrl Barrel Shift AOX (logic) Adder Input Muxs Adder SAT SAT 8/16/2 4 Other Instr Rd Rm Rs Rn Rd Clk ALU Delay control value calculated

  • n the fly

based on instruction

slide-42
SLIDE 42

ARM SILICON LAYOUT (TOP)

dCache L1 Data Memory (32KB) iCache L1 Instruc. Memory (32KB) Data Access Instruction Fetch Branch Predictor Registers ARM Execution Core

slide-43
SLIDE 43

ARM SILICON LAYOUT ( EXECUTION CORE ZOOM)

X-Bar 8 UEs 8 UEs 4 Mul + Tokens 4 Mul + Tokens

slide-44
SLIDE 44

CONTENTS

  • Perspective
  • Background
  • Processor Architecture and Operation
  • Performance Analysis
  • Conclusion
slide-45
SLIDE 45

ARM LAYOUT SIZE (28NM)

Block Area (µm2) Qty Area (µm2) EU & Tokens 8700 16 139200 MUL 7600 8 30400 XBAR 38500 38500 Branch Predictor 12000 12000 iCache (32K) + Instruction Fetch 190000 190000 dCache (32K) 172000 172000 Mem Man + Reg Files 12600 12600 Total

594700

*Note: These areas are extracted from the library which is based on drawn 32nm. Actual 28nm silicon size is smaller.

slide-46
SLIDE 46

ARM POWER BREAKDOWN (28NM)

Block Power (mW) % Dynamic

Central Mux + wires 2.402 3.85 Register module 1.654 2.34 Instruction fetch [synchronous] 4.867 25.0 ALU internals (including calculations) 4.343 6.96 Token modules 2.44 3.91 Everything else 3.332 5.34 Instruction cache [synchronous] 16.07 25.75 Data cache (25% loads) [synchronous] 14.54 23.30 Branch Prediction [synchronous] 12.75 20.43 Total Dynamic Power 62.4 Leakage 12.0 Total power

74.4 Executing Dhrystone @ 2,000 DMIPS

Typical, @25C

slide-47
SLIDE 47

ARM PRELIMINARY RESULTS SUMMARY

Simple ARM Core Compatible Implementation (~A8 equivalent)

  • Technology: 28nm LP STM
  • Performance: 2,000 DMIPS
  • Area (inc. 32KB L1 code and 32KB L1 data): ~0.6 mm2
  • Power: ~75mW @ 2,000DMIPS

This is believed to be good from an area perspective and very good from a power consumption perspective: ~½ the power consumption of equivalent synchronous implementation

slide-48
SLIDE 48

CONTENTS

  • Perspective
  • Background
  • Processor Architecture and Operation
  • Performance Analysis
  • Conclusion
slide-49
SLIDE 49

Octasic – Proprietary & Confidential | Use only pursuant to company instructions

49

POWER REDUCTION SOURCES

  • No balanced clock trees. Clocks are point to point and not

skew sensitive. Therefore smaller gates, more HVT gates and shorter wires used

  • No critical paths due to frequency constraints, therefore no

need to optimize with large gates and can use HVT gates

  • Proximity of pipeline stages (each stage only connected to

previous or next). Therefore smaller gates, use of HVT gates and shorter wires

  • Clock edges are only generated when a resource is used. No

wasted edges (ex: no power use during pipeline stall)

  • All of the above applies to: clocks, logic and data paths
  • Overall results in >> 80% HVT usage
slide-50
SLIDE 50

TEMPORAL CLOCK MODULATION CLOCKS ARE POINT TO POINT <<< FEWER BALANCED CLOCK TREES SMALLER CLOCK DRIVERS VERY LOW DYNAMIC POWER USE OF INSTRUCTION LEVEL PARALLELISM CLOCK PRESENT ONLY IF RESOURSE IS USED OPTIMISED INSTRUCTION TIME HIGHER PERFORMANCE MULTIPLE INSTRUCTIONS IN PARALLEL <<< FEWER CLOCK DRIVERS SHORTER CLOCK LINES CLOCKS LOCAL TO EXECUTION UNITS MINIMAL CLOCKS BETWEEN EXECUTION UNITS SMALLER GATES LOWER LEAKAGE LOGIC IS SMALL & LOCAL SHORTER WIRES <<< FEWER GATES SILICON AREA EFFICIENT

ADDITIONAL POWER EFFICIENCIES

slide-51
SLIDE 51

CONCLUSION

  • Processing power efficiency is becoming more and

more important nowadays .

  • It will be imperative to push back “dark silicon”

phenomenon as silicon power improvements lag performance gains.

  • The application of a practical Asynchronous processor

micro-architecture can improve power efficiencies of commercial general purpose processors by ~2x or more and can help with this problem.

slide-52
SLIDE 52

CONCLUSION Thank you!

Michel Laurence michel.laurence@octasic.com