System-on-Chip Communication Architecture Dr.-Ing. Mohammad - - PowerPoint PPT Presentation

system on chip
SMART_READER_LITE
LIVE PREVIEW

System-on-Chip Communication Architecture Dr.-Ing. Mohammad - - PowerPoint PPT Presentation

1 System-on-Chip Communication Architecture Dr.-Ing. Mohammad Abdullah Al Faruque Chair for Embedded Systems (CES) Karlsruhe Institute of Technology Al Faruque http://ces.univ-karlsruhe.de/ Columns of Embedded System Design 1. Embedded


slide-1
SLIDE 1

Al Faruque http://ces.univ-karlsruhe.de/

1

System-on-Chip Communication Architecture

Dr.-Ing. Mohammad Abdullah Al Faruque

Chair for Embedded Systems (CES)

Karlsruhe Institute of Technology

slide-2
SLIDE 2

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Columns of Embedded System Design

 1. Embedded processor architectures

  • General-purpose computer architectures are hardly

appropriate for ES since they offer a fair compromise between many constraints but they do not allow to adapt to the specific needs for ES

 2. Electronic System-Level design (ESL) methodologies

  • Raising complexity of systems-on-chip (SOC) requires design

methodologies at higher level of abstraction

  • The large design space to be efficiently explored

 3. Embedded Software

  • Software engineering: MDA Model-Driven Architecture, ...

 4. Technology of integrated circuits

  • New technologies offer new possibilities for ES design
  • Example: reconfigurable computing due to advances in FPGA

technology

slide-3
SLIDE 3

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Architectures: General-purpose processor, ASIP, ASIC

Flexibility, 1/time-to-market, … “efficiency”: $/Mips, mW/MHz, Mips/area, … ASIC:

  • Non-programmable,
  • highly specialized

General purpose processor ASIP (extensible processor)

  • instruction extension/definition
  • parameterization
  • inclusion/non-inclusion of

functionality/devices “Hardware solution” “Software solution”

slide-4
SLIDE 4

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Trends: Crisis of Complexity

55 50 47 43 32 25 20 10 8 3 2 1 0.8 0.4 0.3 0.2

50 100 150 200 250 300

Available Gates Used Gates

Millions of Gates 1990 1992 1994 1996 1998 2000 2002 2004 2006

Design Productivity Gap

[source: Gartner/Dataquest]

  • Prediction for the case no

ESLTools will be used

  • However: red curve will

apply and lead to SoCs with 100s –1000s of PEs per chip

slide-5
SLIDE 5

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Next Generation Handheld Devices

  • The download and the TV continue when an

incoming call is accepted

  • Games, Sensor nodes, Navigation etc.

Huge Computational Power and Application Concurrency

  • Computational Power  MPSoC
  • Varying requirements  Exploiting

Application Parallelism

Download File

X

TV – Channel …

X

Incom coming ng Video Call!

Phone

X

slide-6
SLIDE 6

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Maya (Rabaey’00)

slide-7
SLIDE 7

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Maya (Rabaey’00)

slide-8
SLIDE 8

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Maya (Rabaey’00)

slide-9
SLIDE 9

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Maya (Rabaey’00)

slide-10
SLIDE 10

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Maya (Rabaey’00)

slide-11
SLIDE 11

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

The Cell Processor

slide-12
SLIDE 12

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

The Cell Processor

slide-13
SLIDE 13

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

The Cell Processor

 Fclock > 4 GHz.  Memory bandwidth: 25.6 GBytes per second.  I/O bandwidth: 76.8 GBytes per second.  Performance:

 256 GFLOPS (Single precision at 4 GHz).  256 GOPS (Integer at 4 GHz).  25 GFLOPS (Double precision at 4 GHz).

 235 square mm.  235 million transistors.  Power consumption estimated at 60 - 80 W @ 4GHz

slide-14
SLIDE 14

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Cell’s Element Interconnect Bus

 From the trenches: D. Krolak, IBM

  • “Well, in the beginning, early in the

development process, several people were pushing for a crossbar switch, and the way the bus is architected, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just wasn't enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.”

slide-15
SLIDE 15

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Cell’s Element Interconnect Bus

 4 rings (2 ckwise + 2 counter-ckwise)  No token rings, still request/grant arbitrations

slide-16
SLIDE 16

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Very long wires

1 ns (1 GHz) 0.1 ns (10 GHz)

A B

A B

Year 2005 Year 2010

slide-17
SLIDE 17

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Bus pros () and cons ()  Every unit attached adds parasitic capacitance, therefore electrical performance degrades with growth.  Bus timing is difficult in a deep submicron process.  Bus arbiter delay grows with the number of masters. The arbiter is also instance-specific.  Bandwidth is limited and shared by all units attached.  Bus latency is zero once arbiter has granted control.  The silicon cost of a bus is near zero.  Any bus is almost directly compatible with most available IPs, including software running on CPUs.  The concepts are simple and well understood.

slide-18
SLIDE 18

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

What are NoC’s?

 According to Wikipedia:

  • “Network-on-a-chip (NoC) is a new paradigm for System-on-

Chip (SoC) design. NoC based-systems accommodate multiple asynchronous clocking that many of today's complex SoC designs use. The NoC solution brings a networking method to

  • n-chip communications and claims roughly a threefold

performance increase over conventional bus systems.”

 Imprecise…

slide-19
SLIDE 19

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Main diff. between NoC and Bus

P E 2 P E 1 P E 3 P E 4 Bus-based system

P E 2 P E 1 P E 3 P E 4 S S

NoC-based system

  • A single bus does not provide concurrent

transmissions

  • Large bus lengths are prohibitive since
  • geometric. large SoCs plus high

frequencies (~10GHz by end of decade) lead to non-manageable clock skews

  • Packets are transmitted – not words
  • Transactions can be executed in parallel
  • Routers in the network provide for

decoupling -> no clock skew concerns

  • Routing of wires more structured through

tiling -> less complex routing

Buses do not scale!

slide-20
SLIDE 20

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Power consumption and NoCs

 Power is not an issue for large-scale network  For NoCs, it is:

  • Increased number of PEs per chip increase leads to increased

wiring

  • Communication may also increase since NoCs open new

application areas (e.g. embedded multimedia)

  • By principle, a network-based system is still more more

power efficient than a bus-based system

  • > because a bus-based system broadcasts the information to

any possible recipient whereas in a NoC-based system the information (packet) is only been sent to actual recipients

  • Still, due to the trend of communication-centric design styles,

the power consumption of the NoC may be a major power consumer of an SoC

slide-21
SLIDE 21

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Overview

 Motivation  NoC design Challenges  State-of-the-art: Xpipe NoC Architecture  Quality-of-Service (QoS) Architectures

slide-22
SLIDE 22

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

NoC: Good news

 Only point-to-point one-way wires are used, for all network sizes.  Aggregated bandwidth scales with the network size.  Routing decisions are distributed and the same router is re- instanciated, for all network sizes.  NoCs increase the wires utilization (as opposed to ad-hoc p2p wires)

slide-23
SLIDE 23

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

There’s no free lunch…

 Internal network contention causes (often unpredictable) latency.  The network has a significant silicon area.  Bus-oriented IPs need smart wrappers.  Software needs clean synchronization in multiprocessor systems.  System designers need reeducation for new concepts.

slide-24
SLIDE 24

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

vu med cpu rast sdram sram1 sram2 idct ,etc adsp up samp risc au bab

190 0.5 910 0.5 60 40 600 40 250 500 173 670 32 MPEG Core Graph

The Communication Task Graph

slide-25
SLIDE 25

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Application Pull

slide-26
SLIDE 26

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Open Research-Problems  Communication Architecture :

  • Topology
  • Channel width
  • Buffer size
  • Floor planning

 Communication Paradigm :

  • Mapping
  • Scheduling
  • Switching
  • Routing

 Testing :

  • Prototyping
  • Benchmarking
slide-27
SLIDE 27

27

Al Faruque http://ces.univ-karlsruhe.de/

Parameters to be Configured for NoC

Parameters related to communication paradigm Parameters related to application mapping Parameters related to router architecture Topology customization Buffer size customization Floorplanning customization Bandwidth customization Task to IP mapping Task scheduling Routing algorithm Selection of switching scheme Networks-on-Chip

slide-28
SLIDE 28

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

The Topology Problem

Regular Topology vs. Custom Topology

  • If size of the cores

vary widely, area is wasted

  • Communication requirements
  • f the cores vary widely
  • High design effort
  • Hard to predict electrical

behavior (uneven wire length) It's a tradeoff between performance and design costs

slide-29
SLIDE 29

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

The Channel width Problem

  • Tradeoff between throughput and area

used for communication infrastructure

  • Current Prototypes use 32bit
  • Other proposals use 256bit

Higher channel width

  • less latency
  • more area for wiring
  • bigger input buffers needed
slide-30
SLIDE 30

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

500 1000 1500 2000 2500 16 32 64 128 slices

Area: Different Channel width

slide-31
SLIDE 31

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

The Buffer size Problem

  • The buffer size has a high

influence on the latency and the total area needed for the routers There are algorithms to optimize the buffer size under special assumptions Influence of the buffer size in NoC: input buffers increase from 2 to 3 words the router area 30% or more

slide-32
SLIDE 32

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

The Floor planning Problem

  • If tile sizes varies widely ( custom

topologies)

  • floor planning becomes a necessary

step

  • in order to optimize the total area

In contrast to normal floor planning problems you have to take into account:

  • the placement of the routers and repeaters

for predictable latency

  • possible electronic coupling effects in the

communication architecture

slide-33
SLIDE 33

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

The Mapping Problem

IP IP

s s s s s s s s s s s s s s s s

IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP IP

MEM MEM

Topology with empty tiles Application Characterization Graph (APCG) Determine a mapping function that maps the IP- Cores onto the topology

CPU1 CPU2 MEM DSP I/O I/O

CPU 2 DSP CPU 1 I / O O

slide-34
SLIDE 34

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

The Sheduling Problem

PE1 PE1 PE PE2 PE PE3 PE4 PE4

t3 t3 t1 t1 t2 t4 t4 t5 t6

For static scheduling there exist useable algorithms Problem for applications with conditional branches but

slide-35
SLIDE 35

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

The Switching and Routing Problem

Which switching strategy should be used ?

  • Store-and-forward
  • Cut-through
  • Wormhole

Which routing strategy should be used ?

  • Adaptive
  • Deterministic
slide-36
SLIDE 36

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

S D

Command Address Payload

Wormhole Packet:

Flit Flit Flit

QNoC Architecture  Wormhole Routing

 For reduced buffering

Flit (routing info) Flit Flit

Ref: QNoC group

slide-37
SLIDE 37

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Wormhole Switching

 Suits on chip interconnect  Small number of buffers  Low latency  Virtual Channels

  • interleaving packets
  • n the same link

IP1

Interface Interface

IP3 IP2

Interface

slide-38
SLIDE 38

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

IP1

Interface

IP2

Interface

Wormhole Delay Analysis

 The delivery

resembles a pipeline pass

 Packet transmission

can be divided into two separated phases:

  • Path acquisition
  • Packet delivery

 We focus on packet

delivery phase

slide-39
SLIDE 39

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

IP1

Interface

IP2

Interface

 Packet delivery

time is dominated by the slowest link

  • Transmission rate
  • Link sharing

Packet Delivery Time

Low-capacity link

slide-40
SLIDE 40

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Adaptive Algorithm

Header contains packet direction vector indicating destination direction (Virtual Channel ID - VCID) Minimal-Path Adaptive Routing

D S

N E

slide-41
SLIDE 41

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Adaptive Algorithm

 Routing performed in two steps  Partition output paths into 2 choices (next-hop quadrants)  In previous node:

  • Look-Ahead routing (route for current node) determines
  • utput quadrant

 In current node:

  • Select output port
slide-42
SLIDE 42

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Crossbar Switch

East Out Channel North Out Channel South Out Channel West Out Channel Local Out Channel Crossbar Arbiter Addr Decoder Channel Controller North Input Buffer Addr Decoder Channel Controller East Input Buffer Addr Decoder Channel Controller South Input Buffer Addr Decoder Channel Controller West Input Buffer Addr Decoder Channel Controller Local Input Buffer (0,3) (3,0) (2,0) (1,0) (0,0) (3,1) (2,1) (1,1) (0,1) (3,2) (2,2) (1,2) (0,2) (3,3) (2,3) (1,3) (3,1) Router Processing Element Router

Current Implementation

slide-43
SLIDE 43

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Overview

 Motivation  NoC design Challenges  State-of-the-art: Xpipe NoC Architecture  Quality-of-Service (QoS) Architectures

slide-44
SLIDE 44

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

X pipes architecture

 OCP protocol is used for communication with the cores.  Packet partitioning – flit type field indicating head/tail and header payload flits.

slide-45
SLIDE 45

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Network link

slide-46
SLIDE 46

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Switch – output buffer

slide-47
SLIDE 47

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Network interface

 Translates OCP into network packets  NIS - network interface slave

slide-48
SLIDE 48

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Network interface – output buffer

slide-49
SLIDE 49

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Xpipes Compiler

 Heterogeneous network  Automated tool  Uses SystemC macros

slide-50
SLIDE 50

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Overview

 Motivation  NoC design Challenges  State-of-the-art: Xpipe NoC Architecture  Quality-of-Service (QoS) Architectures

slide-51
SLIDE 51

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Guarantees / Probabilistic on

  • Performance related guarantee
  • Reliability related guarantee

Performance:

  • Max end-to-end latency
  • Min throughput (%Bandwidth)
  • Max deviation of latency (Jitter)

Reliability:

  • In-order data transmission
  • Correctness of data
  • No loss of data (Lossless transmission)
  • Availability

Quality Of Services

Service Class Specification

slide-52
SLIDE 52

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Related Works

 Connection Based Approach: i.e. Æthereal

  • K. Goossens et al. Æthereal Network on Chip: Concepts,

Architectures, and Implementations, 2005.

  • E. Rijpkema et al. Trade-offs in the design of a router with both

guaranteed and best-effort services for networks on chip, 2003.

 Service Class Based Approach: i.e. DiffServ, QNoC

  • Evgeny Bolotin et al. QNoC: QoS architecture and design

process for network on chip, 2004.

  • M. D. Harmanci et al. Quantitative modeling and comparison of

communication schemes to guarantee quality-of-service in networks-on-chip, 2005.

  • N. Kavaldjiev et al. A virtual channel Network-on-Chip for GT

and BE traffic, 2006.

slide-53
SLIDE 53

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

 TDMA / Guaranteed throughput  Contention free routing  Buffer reduction

Connection Based Service Class Based

  • Connection management
  • Fixed resource reservation
  • Classification at design time
  • Lookup-tables

+ High resource utilization + Priority aware service

  • No connections (Relative

guarantees)

  • Contention / Starvation
  • Buffer requirements
  • Inflexibility of classification

Related Works

Advantages Disadvantages

Underutilization No Hard Guarantees

slide-54
SLIDE 54

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Philips Research Laboratories Connection oriented architecture Æthereal NoC provides

Best-effort Service Guaranteed Service

TDMA (Time Division Multiple Access)

producer consumer producer consumer

request response

A B

The Æthereal NoC

slide-55
SLIDE 55

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

TDMA—Time Division Multiple Access

what is TDMA? Time slot, different source, share resource a b d c e f R1 R2 a c b e f d T1 T2 s1 s2 s2 s1 s3 s3

How is TDMA adopted in Æthereal NoC

slide-56
SLIDE 56

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

TDMA-Contention free routing

Slot table Duration of the slots are fixed, equal length Maps output to input for every slot

i0

  • 1
  • 2

i3

i1 i2

  • 3

i0 i3

  • 3
  • 2

io

  • 1
  • 2

a b c b a c

i0 i1 i3 i0 i0 i0

s=2 s=2 s=2 T1 T2 T3

  • 2
  • 2o3
  • 1o2

i0

b a s=3 s=3 s=3

slide-57
SLIDE 57

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Distributed Programming Model

Three packets:

  • SetUp
  • TearDown
  • AckSetUp

a b

SetUp Tear Down Ack SetUp

d c IP1 IP2 Like asynchronous transfer mode

slide-58
SLIDE 58

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Centralized programming model

Motivation: smaller chip, mode change infrequently Central configuration processor Use abstract GS or BE ReserveSlot and FreeSlot packets processor A B p p

IP1 IP2

slide-59
SLIDE 59

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

QNoC: QoS NoC

Define Service Levels (SLs):

  • Signaling
  • Real-Time
  • Read/Write (RD/WR)
  • Block-Transfer

 Different QoS for each SL

slide-60
SLIDE 60

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

QNoC Architecture  Mesh Topology  Fixed shortest path routing (X-Y)

 Simple Router (no tables, simple logic)  Power efficient communication  No deadlock scenario

slide-61
SLIDE 61

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

QNoC Wormhole Router

Router

Module Module

  • r

another router

CROSS-BAR

Scheduler Control Routing

CREDIT

Buffers

SIGNAL RT RD/WR BLOCK SIGNAL RT RD/WR BLOCK CREDIT

Scheduler Control Routing

CREDIT SIGNAL RT RD/WR BLOCK SIGNAL RT RD/WR BLOCK CREDIT

Output ports Input ports

Ref: QNoC group

slide-62
SLIDE 62

Mohammad Abdullah Al Faruque

Chair for Embedded Systems

WS09/10

Questions?

Module Module Module Module Module Module Module Module Module Module Module Module

NoC Research Group Group Research NoC