Networks on chip: Evolution or Revolution? Luca Benini - - PDF document

networks on chip evolution or revolution
SMART_READER_LITE
LIVE PREVIEW

Networks on chip: Evolution or Revolution? Luca Benini - - PDF document

Networks on chip: Evolution or Revolution? Luca Benini lbenini@deis.unibo.it DEIS-Universita di Bologna MPSOC 2004 The evolution of SoC platforms MIPS TriMedia SDRAM General-purpose Scalable VLIW MIPS CPU MMI TriMedia CPU


slide-1
SLIDE 1

Networks on chip: Evolution or Revolution?

Luca Benini lbenini@deis.unibo.it DEIS-Universita’ di Bologna

MPSOC 2004

  • L. Benini MPSOC 2004

2

Scalable VLIW Media Processor:

  • 100 to 300+ MHz
  • 32-bit or 64-bit

Nexperia™ System Buses

  • 32-128 bit

General-purpose Scalable RISC Processor

  • 50 to 300+ MHz
  • 32-bit or 64-bit

Library of Device IP Blocks

  • Image coprocessors
  • DSPs
  • UART
  • 1394
  • USB

TM-xxxx D$ I$ TriMedia CPU DEVICE IP BLOCK DEVICE IP BLOCK DEVICE IP BLOCK . . . DVP SYSTEM SILICON PI BUS SDRAM MMI DVP MEMORY BUS DEVICE IP BLOCK PRxxxx D$ I$ MIPS CPU DEVICE IP BLOCK . . . DEVICE IP BLOCK PI BUS

TriMedia™ MIPS™

The evolution of SoC platforms

2 Cores: Philips’ Nexperia PNX8850 SoC

platform for High-end digital video (2001)

slide-2
SLIDE 2
  • L. Benini MPSOC 2004

3

Running forward…

  • Four 350/400 MHz StarCore

SC140 DSP extended cores

  • 16 ALUs: 5600/6400 MMACS
  • 1436 KB of internal SRAM &

multi-level memory hierarchy

  • Internal DMA controller supports

16 TDM unidirectional channels,

  • Two internal coprocesssors

(TCOP and VCOP) to provide special-purpose processing capability in parallel with the core processors

6 Cores: Motorola’s MSC8126 SoC platform

for 3G base stations (late 2003)

  • L. Benini MPSOC 2004

4

What’s happening in SoCs?

Technology: no slow-down in sight!

Faster and smaller transistors … but slower wires, lower voltage, more noise!

Design complexity: from 2 to 10 to 100 cores!

Design reuse is essential …but differentiation/innovation is key for winning

  • n the market!

Performance and power: GOPS for MWs!

Performance requirements keep going up …but power budgets don’t!

slide-3
SLIDE 3
  • L. Benini MPSOC 2004

5

…and on-chip communication?

Starting point: the “on chip bus”

Advances in protocols Advances in topologies

Revolutionary approaches

Networks on chip

Things are moving FAST

…but it’s evolution or revolution?

  • L. Benini MPSOC 2004

6

Outline

Introduction and motivation On-chip networking The HW-SW interface

slide-4
SLIDE 4
  • L. Benini MPSOC 2004

7

On-chip bus Architecture

Many alternatives

Large semiconductor firms (e.g. IBM Coreconnect,

STMicro STBus)

Core vendors (e.g. ARM AMBA) Interconnect IP vendors (e.g. SiliconBackplane)

Same topology, different protocols

  • L. Benini MPSOC 2004

8

AMBA bus

AHB: high-speed high-bandwidth multi-master bus APB: Simplified processor for general purpose peripherals System- Peripheral Bus

CPU EU IO EU Mem Mem CPU AMBA High-speed bus Bridge Master port Slave port

slide-5
SLIDE 5
  • L. Benini MPSOC 2004

9

AHB Bus architecture

Different wires

Dedicated wires

NO Bidirectional wires

  • L. Benini MPSOC 2004

10

AMBA basic transfer

For a write For a read Pipelining increases Bus bandwidth

slide-6
SLIDE 6
  • L. Benini MPSOC 2004

11

Bus arbitraton

ARBITER

Dedicated wires Shared address bus

HBREQ_M3 HBREQ_M2 HBREQ_M1

Arbitration Protocol is defined, but Arbitration Policy is not

  • L. Benini MPSOC 2004

12

The price for arbitration

Time for arbitration

Time for handshaking Wait state

slide-7
SLIDE 7
  • L. Benini MPSOC 2004

13

Burst transfers

Burst transfers amortize arbitration cost

Grant bus control for a number of cycles Help with DMA and block transfers Help hiding arbitration latency

Requires safeguards against starvation

Split and error

  • L. Benini MPSOC 2004

14

Critical analysis: bottlenecks

Protocol

Lacks parallelism

In order completion No multiple outstanding transactions: cannot hide slave wait states

High arbitration overhead (on single-transfers) Bus-centric vs. transaction-centric

Initiators and targets are exposed to bus architecture (e.g. arbiter)

Topology

Scalability limitation of shared bus solution!

slide-8
SLIDE 8
  • L. Benini MPSOC 2004

15

STBUS

On-chip interconnect solution by ST

Level 1-3: increasing complexity (and performance)

Features

Higher parallelism: 2 channels (M-S and S-M) Multiple outstanding transactions with out-of order completion Supports deep pipelining Supports Packets (request and response) for multiple data transfers Support for protection, caches, locking

Deployed in a number of large-scale SoCs in STM

  • L. Benini MPSOC 2004

16

STBUS Protocol (Type 3)

Target

Initiator port Target port

Initiator

Request channel Response channel Transaction Req Packet Resp Packet Cell level Packet level Transaction level Signal level

slide-9
SLIDE 9
  • L. Benini MPSOC 2004

17

STBUS bottlenecks

Protocol is not fully transaction-centric

Cannot connect initiator to target (e.g. initiator does not have control

flow on the response channel)

Packets are atomic on the interconnect

Cannot initiate nor receive multiple packets at the same time Large data transfers may starve other initiators

  • L. Benini MPSOC 2004

18

AMBA AXI

Latest (2003) evolution of AMBA

Advanced eXtensible Interface

Features

Fully transaction centric: can connect M to S with nothing in between Higher parallelism: multiple channels Supports bus-based power management Support for protection, caches, locking

Deployment: ??

slide-10
SLIDE 10
  • L. Benini MPSOC 2004

19

Multi-channel M-S interface

Master Slave

Address Channel Write channel Read channel Write response ch. VALID DATA READY

Channel hanshaking 4 parallel channels are available!

  • L. Benini MPSOC 2004

20

Multiple outstanding transactions

A transaction implies activity on multiple channels

E.g Read uses the Address and Read channel

Channels are fully decoupled in time

Each transaction is labeled when it is started (Address channel) Labels, not signals, are used to track transaction opening and closing Out of order completion is supported (tracking logic in master),

but master can request in order delivery

Burst support

Single-address burst transactions (multiple data channel slots) Bursts are not atomic!

Atomicity is tricky

Exclusive access better than locked access

slide-11
SLIDE 11
  • L. Benini MPSOC 2004

21

Scalability: Execution Time

Highly parallel benchmark (no slave bottlenecks)

AHB AXI STBus STBus (B) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 2 Cores 4 Cores 6 Cores 8 Cores

Relative execution time

AHB AXI STBus STBus (B) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130% 140% 150% 160% 170% 180% 2 Cores 4 Cores 6 Cores 8 Cores

Relative execution time

1 kB cache (low bus

traffic)

256 B cache (high

bus traffic)

  • L. Benini MPSOC 2004

22

Scalability: Protocol Efficiency

AHB AXI STBus STBus (B) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 2 Cores 4 Cores 6 Cores 8 Cores

Interconnect usage efficiency

AHB AXI STBus STBus (B) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

2 Cores 4 Cores 6 Cores 8 Cores

Interconnect busy

Increasing contention: AXI, STBus show 80%+

efficiency, AHB < 50%

slide-12
SLIDE 12
  • L. Benini MPSOC 2004

23

Scalability: latency

2 Cores 4 Cores 6 Cores 8 Cores 1 2 3 4 5 6 7 8 9 10 11 12 13 14

STBus (B) write avg STBus (B) write min STBus (B) read avg STBus (B) read min AXI write avg AXI write min AXI read avg AXI read min

Latency for access completion (cycles)

STBus management has less arbitration latency overhead,

especially noticeable in low-contention conditions

  • L. Benini MPSOC 2004

24

Topology

Single shared bus is

clearly non-scalable

Evolutionary path

“Patch” bus topology

Two approaches

Clustering & Bridging Multi-layer/Multibus B

M M

slide-13
SLIDE 13
  • L. Benini MPSOC 2004

25

Clustering and bridging

  • Heterogeneous architectures with asymmetric traffic

Cost for going across a bridge is HIGH

  • Bus clusters for bandwidth & latency reasons

Example: EASY SoCs for WLAN

T I T I

  • L. Benini MPSOC 2004

26

AMBA Multi-layer AHB

Enables parallel access paths between

multiple masters and slaves

Fully compatible with AHB wrappers Master1 Master2 Slave1 Interconnect Matrix Slave1 Slave1

AHB1 AHB2 Slave Port

slide-14
SLIDE 14
  • L. Benini MPSOC 2004

27

Multi-Layer AHB implementation

The matrix is made of slave ports

No explicit arbitration of slaves Variable latency in case of destination conflicts

Master1 Master2 Slave1 Slave4

Mux Mux Decode Decode

Crossbar arbitration

  • L. Benini MPSOC 2004

28

STBUS Crossbar & Partial CB

PC FC

slide-15
SLIDE 15
  • L. Benini MPSOC 2004

29

Topology speedup (AMBA AHB)

1000000 2000000 3000000 4000000 5000000 6000000 7000000 Semaphore No semaphore

Shared Bridging MultiLayer

Independent tasks (matrix multiply) With & without semaphore synchronization 8 processors (small cache)

  • L. Benini MPSOC 2004

30

Crossbar: critical analysis

No bandwidth reduction Scales poorly

N2 area and delay A lot of wires and a lot of gates in a bus-

based crossbar

E.g. Area_cell_4x4/Area_cell_bus ~2 for STbus

No locality Does not scale beyond 10x10!

slide-16
SLIDE 16
  • L. Benini MPSOC 2004

31

NoCs

More radical solutions in the long term Nostrum HiNoC Linkoeping SoCBUS SPIN Star-connected on-chip network Aethereal Proteo Xpipes … (at least 15 groups)

CPU Memory DSP Memory

link switch network interface

CPU

  • L. Benini MPSOC 2004

32

NOCs vs. Busses

Packet-based

No distinction address/data, only packets (but of

many types)

Complete separation between end-to-end

transactions and data delivery protocols

Distributed vs. centralized

No global control bottleneck Better link with placement and routing

Bandwidth scalability, of course!

STBUS and AXI

slide-17
SLIDE 17
  • L. Benini MPSOC 2004

33

The “power of NoCs”

Design methodology

Clean separation at the session layer:

1. Define end-to-end transactions 2. Define quality of service requirements 3. Design transport, network, link, physical

Modularity at the HW level: only 2 building blocks

1. Network interface 2. Switch (router)

Scalability is supported from the ground up (not as an afterthought)

  • L. Benini MPSOC 2004

34

Building blocks: NI

Session-layer interface with nodes Back-end manages interface with switches

Front end Backend

Standardized node interface @ session layer. Initiator vs. target distinction is blurred

  • 1. Supported transactions (e.g. QoSread…)
  • 2. Degree of parallelism
  • 3. Session prot. control flow & negotiation

NoC specific backend (layers 1-4) 1. Physical channel interface 2. Link-level protocol 3. Network-layer (packetization) 4. Transport layer (routing)

Node Switches

slide-18
SLIDE 18
  • L. Benini MPSOC 2004

35

Building blocks: Switch

Router: receives and forwards packets

  • NOTE: Packet-based does not mean datagram!

Level 3 or Level 4 routing

  • No consensus, but generally L4 support is limited (e.g. simple routing)

Crossbar

Allocator Arbiter

Output buffers & control flow Input buffers & control flow QoS & Routing Data ports with control flow wires

  • L. Benini MPSOC 2004

36

Xpipes: context

Typical applications targeted by SoCs

Complex Highly heterogeneous Communication intensive

Xpipes is a synthesizable, high performance,

heterogeneous NoC infrastructure

Task1 Task2 Task4 Task3 SB Task5 P1(T1) P4(T4) P3(T3) P5(T5) NI NI NI NI L1

slide-19
SLIDE 19
  • L. Benini MPSOC 2004

37

Heterogeneous topology

SoC component specialization lead to the integration of heterogeneous cores

  • Ex. MPEG4 Decoder
  • Non-uniform block sizes
  • SDRAM: communication

bottleneck

  • Many neighboring cores

do not communicate Risk of under-utilizing many tiles and links Risk of localized congestion On a homogeneous fabric:

  • L. Benini MPSOC 2004

38

Network interface

Open Core Protocol (OCP) End-to-end communication protocol

  • pipelining
  • independence of request/response

phase Network protocol

IP

N e t w

  • r

k I n t e r f a c e

Network

PAYLOAD

HEADER

TAIL

Packet

FLIT FLIT FLIT

FLIT

Header includes: Path across the network Source Destination Command type Burst ID (MBurst) Packet identifier within message (ID-PACKET) Local target IP address (IP_ADDR)

Transaction centric

slide-20
SLIDE 20
  • L. Benini MPSOC 2004

39

Switch (s-Xpipes)

Crossbar

Allocator Arbiter

  • Plain latching of inputs
  • Buffering resources are on the output ports
  • FIFOs for performance (tunable area/speed tradeoff)
  • Circular buffers for ACK/NACK management (minimal size if directly

attached to downstream component, can be larger for pipelined links)

  • ACK/NACK flow control
  • 2-stage pipeline
  • Tuned for high clock speeds
  • L. Benini MPSOC 2004

40

Example: MPEG4 decoder

Core graph representation with annotated

average communication requirements

slide-21
SLIDE 21
  • L. Benini MPSOC 2004

41

NoC Floorplans

General purpose: mesh Application Specific NoC1 (centralized) Application Specific NoC2 (distributed)

  • L. Benini MPSOC 2004

42

Performance, area and power

Relative link utilization

(customNoC/meshNoC): 1.5, 1.55

Relative area

(meshNoC/customNoC): 1.52, 1.85

Relative power

(meshNoC/customNoC): 1.03, 1.22

Less latency and better Scalability of custom NoCs

slide-22
SLIDE 22
  • L. Benini MPSOC 2004

43

NoC synthesis flow

In cooperation with Stanford Univ.

SUNMAP

Power Lib Area Lib Floor- planner xpipes Library xpipes Compiler SystemC Design Simu- lation

Mapping Onto Topologies

Topology Selection Topology Library Routing Function Co-Design Appln

  • L. Benini MPSOC 2004

44

Outline

Introduction and motivation On-chip networking The HW-SW interface

Session layer and above

slide-23
SLIDE 23
  • L. Benini MPSOC 2004

45

Mapping applications

Applications

Abstract Parallel architecture

T1 T1 T2 T2 T3 T3 B B E E PE PE PE PE NoC PE PE M M M M IO IO

  • Communication abstractions
  • Shared memory (UMA vs NUMA)
  • Message passing
  • What hardware support to

communication abstractions?

  • L. Benini MPSOC 2004

46

MPARM Architecture

INTERCONNECTION ARM ARM

INTERRUPT CONTROLLER

PRI MEM 4 SHARED MEM

SEMAPHORES

ARM ARM

PRI MEM 3 PRI MEM 2 PRI MEM 1

STbus or AMBA or Xpipes

slide-24
SLIDE 24
  • L. Benini MPSOC 2004

47

Basic architecture

MMU I/D Cache

INTERCONNECTION ARM Core

SHARED MEM

Processor tile #1

SEMAP HORES

MMU I/D Cache

ARM Core Processor tile #N

  • L. Benini MPSOC 2004

48

Support for message passing

MMU I/D Cache

Scratch-pad

INTERCONNECTION ARM Core

SHARED MEM

Processor tile #1

Semaphores

MMU I/D Cache

Scratch-pad

ARM Core Processor tile #N

Semaphores

slide-25
SLIDE 25
  • L. Benini MPSOC 2004

49

HW support for MP: results

8 cores 0.00% 25.00% 50.00% 75.00% 100.00% 125.00% 150.00% 175.00% 200.00% 225.00% 250.00% 275.00%

Shared Bridging MultiLayer

Relative execution time

8 cores 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% 110.00% 120.00%

Shared Bridging MultiLayer

Relative execution time

Matrix Pipeline with basic architecture Matrix Pipeline with message passing support

170% 20%

Send+Receive cost: 35KCycles (basic architecture) vs. 4KCycles (MP support) Configuration: 4 Processors, Shared bus

  • L. Benini MPSOC 2004

50

ARM Core

ARM CORE

Support for UMA

CACHE

BUS*

SNOOP DEVICE

Invalidate/Update Address and Data

Processor tile #1

*cannot be a generic interconnect!

slide-26
SLIDE 26
  • L. Benini MPSOC 2004

51

Readers-writers: varying cache size

Cycles

0.8 0.85 0.9 0.95 1 1.05 1.1 512 1024 2048 4096 SW WTI WTU

Energy

0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 512 1024 2048 4096 SW WTI WTU Energy-Delay product 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 512 1024 2048 4096 SW WTI WTU

Power

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 512 1024 2048 4096 SW WTI WTU

  • L. Benini MPSOC 2004

52

Readers-writers: varying buffer size

Cycles

0.8 0.85 0.9 0.95 1 1.05 1.1 16 256 1024 SW WTI WTU

Energy

0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 16 256 1024 SW WTI WTU

Energy-Delay product

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 16 256 1024 SW WTI WTU Power 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 16 256 1024 SW WTI WTU

slide-27
SLIDE 27
  • L. Benini MPSOC 2004

53

Conclusions

Evolutionary shift from bus-based

interconnect to NoCs

Well underway (there’s no stopping now) Methodology/tooling is the main issue

Platform challenges

Programming abstraction HW/SW tradeoffs in session layer support