SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: - - PowerPoint PPT Presentation

soc design lecture 10 on chip interconnection networks
SMART_READER_LITE
LIVE PREVIEW

SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: - - PowerPoint PPT Presentation

SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: On Chip Interconnection Networks Shaahin Hessabi Department of Computer Engineering g g Sharif University of Technology Signal Transmission on SoC We focus on global wires


slide-1
SLIDE 1

SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: On Chip Interconnection Networks

Shaahin Hessabi Department of Computer Engineering g g Sharif University of Technology

slide-2
SLIDE 2

Signal Transmission on SoC

We focus on global wires

Local wires can scale with technology, and present design styles may still apply.

Global wires are on top level metals (with higher pitch and width).

Increased pitch reduces cross-coupling (improving noise immunity). Increased width reduces wire resistance. Increased spacing around the wire prevents capacitance growth. Inductive effects grows relative to resistance and capacitance.

g p

Future global wires modeled as lossy transmission lines, as opposed to RC models. Causes signal attenuation and dispersion in frequency of fast signals. Can be reduced by splitting wires in several sections with buffers in between Can be reduced by splitting wires in several sections with buffers in between.

  • Impedance matching required due to line inductance.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

2

slide-3
SLIDE 3

Signal Integrity

Signal integrity: error-free information transfer (at the physical level) on global

wires will become harder, due to:

Signal swings are reduced, with a corresponding reduction in voltage noise margins. Crosstalk increases. M

EMI b f ll lt i d ll d i t it

More EMI because of smaller voltage swings and smaller dynamic storage capacitances. More synchronization failures and/or metastability, because of transmission speed changes, local

clock frequency changes, timing noise ( jitter), and so on.

Soft errors will be a potential hazard for large SoCs as well.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

3

slide-4
SLIDE 4

On-Chip Interconnection Networks

Shared-Medium Networks Switched-media Networks (Direct and Indirect Networks) Hybrid Networks Hybrid Networks

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

4

slide-5
SLIDE 5

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

5

slide-6
SLIDE 6

Shared-Medium Networks

Simplest interconnect structures. Transmission medium is shared by all communication devices.

y

Network is usually passive: does not generate control or data messages. Serialization: Only one component can send a message at any given time.

y p g y g

Order of messages.

Interconnection structures:

Point-to-point On-chip bus On-chip network On-chip network

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

6

slide-7
SLIDE 7

Types of Busses

Processor-memory bus (design specific).

Short and high speed. Short and high speed. Only need to match the memory system.

Maximize memory-to-processor bandwidth.

C

di l h

Connects directly to the processor. Optimized for cache block transfers.

I/O bus (industry standard).

I/O bus (industry standard).

Usually is lengthy and slower. Needs to match a wide range of I/O devices Needs to match a wide range of I/O devices. Connects to the processor-memory bus or backplane bus.

  • Backplane bus (standard or proprietary).

Backplane: an interconnection structure within the chassis. Allow processors, memory, and I/O devices to coexist. Cost advantage: one bus for all components

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

7

Cost advantage: one bus for all components.

slide-8
SLIDE 8

Traditional Bus vs. OCB

Traditional Bus (Of aditional Bus (Off-Chip Bus)

  • Chip Bus)

OCB (On-Chip Bus) OCB (On-Chip Bus)

Shared I/O Fixed interconnection scheme Routing resource in target device

(e.g., FPGA, ASIC)

Fixed timing requirement Dedicated address decoding Bandwidth and latency are

important

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

8

slide-9
SLIDE 9

Off-Chip Bus

Connection of discrete chips on a PCB.

PCI, ISA, … are off-chip busses.

Design Criteria:

High-speed communication between discrete devices (about 30MHz-100MHz). Minimizing the number of bus signals, i.e., pins, for reducing the cost of PCB. Minimizing the number of bus signals, i.e., pins, for reducing the cost of PCB. Tri-state signaling for add-in cards and extensions to disconnect the non-active cards. PCI uses multiplexed signals for address and data.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

9

slide-10
SLIDE 10

On-Chip Busses

No use of tri-state signals : Tri-state bus is difficult for static timing analysis as the

bus loading is only identified through dynamic simulation.

High-performance transaction schemes

Point-to-point protocol Split transaction Split transaction Efficient arbitration schemes are adopted.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

10

slide-11
SLIDE 11

Shared I/O (Multiplexer Bus vs. Tri-State Bus) / ( p )

Three-state I/O is slower than direct interconnection. Solution in OCB: Mux interconnection.

Xilinx design guidelines: recommended, because of technology-independency and more

portability.

Multiple Multiplexer Bus r Bus

  • Multiplexed functional I/O (e.g., address/data) needs more time to transfer data.

Solution in OCB: multiple busses

Three-Stat Three-State Bus e Bus

Only one bus master can output

address or data (otherwise

Multiple Multiplexer Bus r Bus

Bus Masters can send their requests

including address and data (for write) h i collision).

Bus Grant is needed to

  • utput address or data.

at the same time.

Arbiter selects a bus master.

p

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

11

slide-12
SLIDE 12

Physical Constraints

Fixed interconnection scheme:

Traditional busses usually routed across a standard backplane. OCB allows variable interconnection scheme, defined by system integrator (tool level)

Fixed timing requirement:

Traditional busses have fixed timing requirements:

Highly capacitive and inductive loads. Designed for the worst case operating conditions, when unknown bus modules are connected together.

OCB has a variable timing specification that:

Can be enforced by place & route tools (tool level). Usually do not specify absolute timing Usually do not specify absolute timing. Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

12

slide-13
SLIDE 13

Bus Components

Switch or node

Arbitration, routing

Converter or bridge (type converter)

From one protocol to another

Size converter

Buffering capacity

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

13

slide-14
SLIDE 14

Bussing Strategies

Register-to-Register Communications:

Point-to-point. Single shared bus. Multiple special purpose busses.

T d ff b d h/ l l i d f ll li

Tradeoffs between datapath/control complexity and amount of parallelism

supported by the hardware. Master vs Slave Master vs. Slave

A bus transaction includes two parts:

Master: Issuing the command (and address) – request.

g ( ) q

Slave: Transferring the data – action.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

14

slide-15
SLIDE 15

A Computer System with One Bus: Backplane Bus p y p

Backplane bus: The most common on chip shared medium architecture Backplane bus: The most common on-chip shared-medium architecture. A single bus is used for:

Processor to memory communication. Processor to memory communication. Communication between I/O devices and memory.

Low-overhead interconnection for a small number of active processors (i.e., bus

masters) and a large number of passive modules (i.e., bus slaves) that only respond to requests from bus masters. Di d l b b j b l k

Disadvantages: slow, bus can become a major bottleneck.

Example: IBM PC.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

15

slide-16
SLIDE 16

A Two-Bus System

I/O busses tap into the processor-memory bus via bus adaptors:

Processor-memory bus: mainly for processor-memory traffic. I/O buses: provide expansion slots for I/O devices.

A l M i h II

Apple Macintosh-II:

NuBus: Processor, memory, and a few selected I/O devices. SCCI Bus: the rest of the I/O devices.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

16

slide-17
SLIDE 17

A Three-Bus System

A small number of backplane busses tap into the processor-memory bus:

Processor-memory bus is only used for processor-memory traffic. I/O buses are connected to the backplane bus I/O buses are connected to the backplane bus.

Advantage: loading on the processor bus is greatly reduced.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

17

slide-18
SLIDE 18

Bus Advantages

Versatility: Any bus is almost directly compatible with most available IPs.

New devices can be added easily. Peripherals can be moved between computer systems that use the same bus standard.

Low cost: The silicon cost of a bus is near zero. Bus latency is zero once arbiter has granted control. Concepts are simple and well understood.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

18

slide-19
SLIDE 19

Bus Disadvantages

Scalability Contention Power issues Creates a communication bottleneck.

Bandwidth of bus can limit the maximum I/O throughput.

The maximum bus speed is largely limited by:

The length of the bus. The number of devices on the bus (bus loading).

Every unit attached adds parasitic capacitance Every unit attached adds parasitic capacitance.

The need to support a range of devices with:

Widely varying latencies. W d l

d f

Widely varying data transfer rates.

Bus arbiter delay grows with the number of masters.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

19

slide-20
SLIDE 20

What Defines a Bus?

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

20

slide-21
SLIDE 21

Bus Protocols

Protocols determine:

The transactions that are supported. The timing of their cycles. How modules are addressed. Allocation of resources Allocation of resources.

Without a special bus protocol the bus is not efficiently used.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

21

slide-22
SLIDE 22

Bus Pipelining

A memory access consists of several cycles (including arbitration). Bus not used in all cycles pipelining used to increase the performance.

us ot use a cyc es p pe g use to c ease t e pe o a ce.

Only one transaction can

Receive the grant during a given cycle. Use the bus during a given cycle.

Pipelining leads to an efficient use of the bus. Stalls are inserted since only one instance can use the bus.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

22

slide-23
SLIDE 23

Bus Properties

Support for broadcast of information

Highly advantageous when communication is highly asymmetric Highly advantageous when communication is highly asymmetric.

Every device connected to the network has a network interface:

requester, driver, and receiver circuits.

T

th b

To access the bus:

Bus Transmit: ET active Bus Receive: ER active

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

23

slide-24
SLIDE 24

Cycles, Messages and Transactions

Message: Logical unit of information

A read message contains an address and control signals for read. Three classes of information units on a bus: data, address, and control.

Either time-multiplexed on the bus, or travel over dedicated busses/wires. Tradeoff between hardware cost (area) and performance Tradeoff between hardware cost (area) and performance.

Cycles: A message requires a number of cycles to be sent from sender to receiver

  • ver the bus.

Transaction: A sequence of messages which together form a transaction.

A memory read requires a memory read message and a reply with the requested data.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

24

slide-25
SLIDE 25

Bus Options

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

25

slide-26
SLIDE 26

Increasing the Bus Bandwidth

Separate versus multiplexed address and data lines:

Address and data can be transmitted in one cycle with separate address and data lines. Cost: (a) more bus lines, (b) increased complexity.

Data bus width:

By increasing the bus width, transfers of multiple words require fewer bus cycles.

Example: SPARCstation 20’s memory bus is 128 bit wide.

Cost: more bus lines.

Block transfers:

Allow the bus to transfer multiple words in back-to-back bus cycles. Only one address needs to be sent at the beginning. The bus is not released until the last word is transferred. Cost: Cost:

Increased complexity. Decreased response time for request. Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

26

slide-27
SLIDE 27

Increasing Transaction Rate on Multi-master Bus

Overlapped arbitration.

Perform arbitration for next transaction during current transaction.

Bus parking.

Master holds onto bus and performs multiple transactions as long as no other master makes

t request. Overlapped address / data phases. Split phase (or packet switched) bus Split-phase (or packet switched) bus.

Completely separate address and data phases. Arbitrate separately for each. Address phase yield a tag which is matched with data phase.

All of the above in most modern memory busses.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

27

slide-28
SLIDE 28

Synchronous vs. Asynchronous Operation

Synchronous bus: includes a common clock in the control lines.

Examples: AMBA bus, PC busses (PCI, ISA, etc), Sun P/M, SCSI, …

A fixed protocol for communication that is relative to the clock. Advantage: involves very little logic and can run very fast. Disadvantages: Disadvantages:

Every device on the bus must run at the same clock rate. To avoid clock skew, they cannot be long if they are fast.

A h b l k d b i h d h ki l

Asynchronous bus: not clocked, but requires a handshaking protocol.

Examples: MicroChannel (IBM), SCSI 2, VME, MARBLE (AMULET), … Custom designed busses Can accommodate a wide range of devices. Can be lengthened without worrying about clock skew.

Current commercial on-chip busses are synchronous.

p y

Bus clock is slower than the clock of fast masters.

Simplicity and ease of testing/debugging is prioritized over performance. Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

28

slide-29
SLIDE 29

Routing, Arbitration, Switching

Routing

Which of the possible paths are allowable (valid) for packets? Provides the set of operations needed to compute a valid path. Executed at source, intermediate, or even at destination nodes.

Arbitration

t at o

When are paths available for packets? (along with flow control) Resolves packets requesting the same resources at the same time. For every arbitration there is a winner and possibly many losers For every arbitration, there is a winner and possibly many losers. Losers are buffered (lossless) or dropped on overflow (lossy).

Lossy networks: Packets are dropped (discarded) at receiver when buffers fill up. Sender is notified to

retransmit packets (via time out or NACK) retransmit packets (via time-out or NACK).

Switching

How are paths allocated to packets? The winning packet (from arbitration) proceeds towards destination. Paths can be established one fragment at a time or in their entirety.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

29

slide-30
SLIDE 30

Arbitration: Obtaining Access to the Bus

One of the most important issues in bus design:

How is the bus reserved by a device that wishes to use it?

Chaos is avoided by a master-slave arrangement:

O l th b

t t l t th b

Only the bus master can control access to the bus: It initiates and controls all bus requests

A slave responds to read and write requests

p q

The simplest system:

Processor is the only bus master All bus requests must be controlled by the processor Major drawback: the processor is involved in every transaction

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

30

slide-31
SLIDE 31

Multiple Bus Masters: Need for Arbitration

Bus arbitration scheme:

A bus master wanting to use the bus asserts the bus request. A bus master cannot use the bus until its request is granted A bus master cannot use the bus until its request is granted. A bus master must signal to the arbiter the end of the bus utilization.

Starvation

Arises when packets can never gain access to requested resources Arises when packets can never gain access to requested resources. Solution: Grant resources to packets with fairness, even if prioritized.

Bus arbitration schemes usually try to balance two factors:

Bus priority: the highest priority device should be serviced first Bus priority: the highest priority device should be serviced first. Fairness: even the lowest priority device should never be completely locked out from the bus.

Bus arbitration schemes can be divided into four broad classes:

Daisy chain arbitration Daisy chain arbitration. Centralized, parallel arbitration. Distributed arbitration by self-selection: each device wanting the bus places a code indicating its

identity on the bus. y

Distributed arbitration by collision detection: each device just “goes for it”. Problems found after the

  • fact. (Ethernet).

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

31

slide-32
SLIDE 32

Bus Arbitration

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

32

slide-33
SLIDE 33

The Daisy Chain Bus Arbitrations Scheme

Advantage: simple Advantage: simple. Disadvantages:

Cannot assure fairness: A low-priority device may be locked out indefinitely.

p y y y

The use of the daisy chain grant signal also limits the bus speed.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

33

slide-34
SLIDE 34

Centralized Parallel Arbitration

Used in essentially all processor-memory busses and in high-speed I/O busses.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

34

slide-35
SLIDE 35

Arbitration Mechanisms

Currently performed centralized by a bus arbiter module. Processor must first gain bus mastership from the arbiter.

g p

Implies a control transaction and communication performance loss.

Arbitration should be as fast, and as rare, as possible.

Al

f l b l f l

Also, response time of slow bus slaves may cause serious performance losses.

Because bus remains idle while the master waits for the slave to respond.

Arbiters are not only used in bus-system, but everywhere where several devices

y y , y request shared resources.

In NoCs, arbitration is needed, if two or more packets want to enter the same channel

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

35

slide-36
SLIDE 36

Arbiter Interfaces

Arbiter interface can be used to give a bus grant for:

a fixed number of cycles; variable lengths.

Grant is hold as long as the “hold”-line (controlled by client) is asserted.

F i

i k t f bit

Fairness is a key property of an arbiter:

Weak fairness: Every request is eventually served. Strong fairness: Requests will be served equally often.

g q q y

Weighted strong fairness: # of times requester i is served is equal to its weight wi FIFO fairness: Requests are served in the order the requests have been made.

l l b l f h l f b b f

Local vs. global fairness: a system with several fir arbiters may not be fair:

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

36

slide-37
SLIDE 37

Fixed-Priority Arbiter

A fixed-priority arbiter can be constructed as an iterative circuit Each cell receives a request input ri and a carry input ci and

generates a grant output gi and a carry output ci+1

The resulting arbiter is not fair, since a continuously asserted

t th t f th th t ill b request r0 means that none of the other requests will ever be served!

Fair arbiter can be generated by changing the priority from cycle

Fair arbiter can be generated by changing the priority from cycle to cycle.

Only one input pi has the value 1. Other inputs pj have value 0.

y p pi p pj

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

37

slide-38
SLIDE 38

Oblivious Arbiters

If pi is generated without knowledge of ri and gi, the result

is an oblivious (unconscious) arbiter

Examples are:

Randomly generated pi Rotating priorities (by shift register)

Oblivious arbiters provide

weak fairness; weak fairness; but not strong fairness,

(i.e. if r0 and r1 are constantly asserted) request r1 wins the arbitration

l h ll h h

  • nly when p1 is true, in all other cases r0 gets the grant

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

38

slide-39
SLIDE 39

Round-Robin Arbiter

A round-robin arbiter achieves strong fairness. A request that was just served gets the lowest priority.

q j g p y

A weighted round-robin arbiter allows to give requesters a larger number of

grants than other requesters in a controlled fashion.

If three devices have the weight 1,2,3 they get 1/6, 1/3 and ½ of the grants. The preset line is activated periodically after N (here 6) cycles to load the counter

with its weight.

If some modules do not issue any requests during that interval, the shared

ill i idl il h l resource will remain idle until the next preset cycle.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

39

slide-40
SLIDE 40

Matrix Arbiter

A matrix arbiter implements a least recently served priority scheme by maintaining a

triangular array of state bits wij for all i < j. g y

j

Fast, easy to implement, and provides strong fairness.

Hence, very good suited for a small number of inputs.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

40

slide-41
SLIDE 41

Queuing Arbiter

A queuing arbiter provides FIFO fairness. It assigns each request a time stamp when it is asserted.

g q p

The request with the earliest time stamp receives the grant.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

41

slide-42
SLIDE 42

Split-Transaction Bus

In a split-transaction bus a transaction is split into two transactions.

”request”-transaction ”reply”-transaction

Both transactions have to compete for the bus by arbitration. The advantages of the split-transaction bus are evident, if there is a variable delay

for requests.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

42

slide-43
SLIDE 43

Terms and Definitions

Bandwidth (BW): Maximum rate (bps or Bps) at which information can be

transferred (including packet header, payload, trailer)

Aggregate BW:Total data bandwidth supplied by network Effective BW(throughput): fraction of aggregate bandwidth delivered to application

Time of flight: Time for first bit of a packet to arrive at the receiver

Time of flight: Time for first bit of a packet to arrive at the receiver

Includes the time for a packet to pass through the network, not including the transmission time

Transmission time: The time for a packet to pass through the network, not

i l di h i f fli h including the time of flight

Equal to the packet size divided by the data bandwidth of the link

Transport latency:Time of flight + transmission time

Transport latency:Time of flight transmission time

Measures the time that a packet spends in the network

Sending overhead (latency): Time to prepare a packet for injection, including

hardware/software

A constant term (packet size) plus a variable term (buffer copies)

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

43

slide-44
SLIDE 44

Terms and Definitions (cont’d)

Receiving overhead (latency): Time to process an incoming packet at the end node

A constant term plus a variable term I

l d t f i t t k t d d bl

Includes cost of interrupt, packet reorder and message reassembly Latency = Sending Overhead + Time of flight + + Receiving Overhead

packet size Bandwidth

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

44

slide-45
SLIDE 45

Terms and Definitions (cont’d)

Effective bandwidth with link pipelining

Pipeline the flight and transmission of packets over the links Overlap the sending overhead with the transport latency and receiving overhead of prior packets

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

45

slide-46
SLIDE 46

Terms and Definitions (cont’d)

Effective bandwidth with link pipelining

Pipeline the flight and transmission of packets over the links Overlap the sending overhead with the transport latency and receiving overhead of prior packets

P k t i BWLinkInjection = Packet size max (sending overhead, transmission time) BW Packet size BWLinkReception = max (receiving overhead, transmission time)

  • Eff. bandwidth = min (2xBWLinkInjection , 2xBWLinkReception) =

2 x Packet size ( h d t i i ti ) (

LinkInjection , LinkReception) max (overhead, transmission time)

  • verhead = max (sending overhead, receiving overhead)

(only two devices)

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

46

slide-47
SLIDE 47

Shared-Medium Networks Summary

The network media is shared by all the devices. Arbitration

Centralized arbiter for smaller distances between devices

Dedicated control lines

D

b d f f b

Distributed forms of arbiters

CSMA/CD: Carrier Sense Multiple Access with Collision Detection The device first checks the network (carrier sensing) Then checks if the data sent was garbled (collision detection) If collision retransmission: wait an increasing exponential random amount of time beforehand Fairness is not guaranteed

g

Token ring—provides fairness Owning the token provides permission to use network media

Node Node Node

token holder

X

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

47

slide-48
SLIDE 48

Shared-Medium Networks Summary (cont’d)

Switching is straightforward: the granted device connects to the shared media Routing: routing is straightforward

g g g

Performed at all the potential destinations

Each end node device checks whether it is the target of the packet

B

d d l l

Broadcast and multicast is easy to implement

Every end node devices sees the data sent on shared link anyway

Established order: arbitration, switching, and then routing

, g, g

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

48

slide-49
SLIDE 49

Shared-Medium Networks Summary (cont’d)

Advantages:

Simple topology, Low area cost, Easy to build, Efficient to implement.

Di d

Disadvantages:

Larger load per data bus line, Longer delay for data transfer,

g y ,

Larger energy consumption,

Since every data transfer is broadcast.

L

b d idth

Lower bandwidth,

Cannot be solved by using a low-voltage swing signaling technique.

Scalability is seriously limited.

Convenient for current SoCs that integrate less than 5 processors and rarely more than 10 bus masters. Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

49

slide-50
SLIDE 50

On-Chip Interconnection Networks

Shared Medium Networks Shared-Medium Networks Switched-media Networks (Direct and Indirect Networks) Hybrid Networks

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

50

slide-51
SLIDE 51

Switched-Media Networks

Disjoint portions of the media are shared via switching Switch fabric components

p

Passive point-to-point links Active switches

Dynamically establish communication between sets of source-destination pairs

Aggregate bandwidth can be many times higher than that of shared-media

networks networks

Node Node

Switch Fabric

Node Node

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

51

slide-52
SLIDE 52

Switched-Media Networks (cont’d)

Routing

Every time a packet enters the network, it is routed

Arbitration

Centralized or distributed Resolves conflicts among concurrent requests

Switching

Once conflicts are resolved the network “switches in” the required connections Once conflicts are resolved, the network switches in the required connections

Established order: routing, arbitration, and then switching

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

52

slide-53
SLIDE 53

Shared- vs. Switched-Media Networks

Shared-media networks

Low cost Aggregate network bandwidth does not scale with number of devices Global arbitration scheme required (a possible bottleneck) Time of flight increases with the number of end nodes Time of flight increases with the number of end nodes

Switched-media networks

Aggregate network bandwidth scales with number of devices

gg g

Concurrent communication

Potentially much higher network effective bandwidth

B

i ffi i t d i it ibl

Beware: inefficient designs are quite possible

Superlinear network cost but sublinear network effective bandwidth Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

53

slide-54
SLIDE 54

Distributed Switched (Direct) Networks

Direct or point-to-point networks: as the number of nodes in the system increases,

the total communication bandwidth also increases.

Overcomes the scalability problems.

Popular for building large-scale systems.

Each node directly connected with a subset of other nodes in the network (neighboring nodes) Each node directly connected with a subset of other nodes in the network (neighboring nodes). Nodes are on-chip computational units, contain a network interface block (router), which

handles communication-related tasks.

Each router is directly connected with the routers of the neighboring nodes.

More energy efficient than shared medium networks.

Since energy per transfer on a point to point communication channel is smaller than that on a Since energy per transfer on a point-to-point communication channel is smaller than that on a

large shared-medium architecture.

Should consider the energy for several point-to-point links. Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

54

slide-55
SLIDE 55

Direct Network Example: RAW architecture

A fully (tiles=PEs, interconnects) programmable

SoC, consisting of an array of identical

Tile 0 Tile 1 Tile 2 Tile 3

Router

RISC

g y computational tiles with local storage.

To accomplish programmable communication, each

tile has a programmable router (switch processor)

Tile 4 Tile 5 Tile 6 Tile 7

tile has a programmable router (switch processor).

RAW can be viewed as a direct network. Inside the router:

Tile 8 Tile 9 Tile 10 Tile 11 Tile 12 Tile 13 Tile 14 Tile 15

Scheduler

select

  • ut

Buf West request

in West West

Crossbar out

  • ut

in in

config

Buf West Buf South Buf East

South East North West South East

router μarchitecture

in

5 x 5

  • ut

in

Buf North Buf Local

  • ut

grant

North Local North Local

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

55

router μarchitecture

slide-56
SLIDE 56

Distributed Switched (Direct) Networks

Fully-connected network: all nodes are directly connected

to all other nodes using bidirectional dedicated links.

7 1

g

No advantage over a crossbar.

6 2 5 3 4 5 3

Bidirectional Ring networks:

N switches (3 × 3) and N bidirectional network links N switches (3

3) and N bidirectional network links

Simultaneous packet transport over disjoint paths Packets must hop across intermediate nodes Sh

di i ll l d (N/4 h

Shortest direction usually selected (N/4 hops, on

average)

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

56

slide-57
SLIDE 57

Distributed Switched (Direct) Networks

Bidirectional Ring networks (folded):

N switches (3 × 3) and N bidirectional network links N switches (3 × 3) and N bidirectional network links Simultaneous packet transport over disjoint paths Packets must hop across intermediate nodes Shortest direction usually selected (N/4 hops, on average)

Folded ring: Lower Lower maximum physical link length Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

57

slide-58
SLIDE 58

Distributed Switched (Direct) Networks:

Fully connected and ring topologies: the two extremes The ideal topology:

p gy

Cost approaching a ring Performance approaching a fully connected (crossbar) topology

More practical topologies:

k-ary n-cubes (meshes, torus, hypercubes)

k nodes connected in each dimension with n total dimensions k nodes connected in each dimension, with n total dimensions Symmetry and regularity network implementation is simplified

i i i lifi d

routing is simplified Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

58

slide-59
SLIDE 59

Centralized Switched (Indirect) Networks

A connection between nodes has to go through a set of switches.

The network adapter associated with each node connects to a port of a switch. Switches only provide a programmable connection between their ports; i.e., set up a

communication path that can be changed over time.

Distinction between direct and indirect networks is blurring Distinction between direct and indirect networks is blurring,

Since routers and switches are getting more complex and absorb each other’s functionality.

Reconfigurable micronetworks exploit programmable routers/switches.

Use multiplexers whose control signals are set by configuration bits in local storage, as in the

case of FPGAs.

Interface circuitry and network control policies must be kept extremely simple for FPGAs, Interface circuitry and network control policies must be kept extremely simple for FPGAs,

can be much more complex when supporting coarser grain information transfers.

2 topologies:

Crossbar network Multistage interconnection networks (MINs)

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

59

slide-60
SLIDE 60

Examples

Xilinx SpartanII FPGA: CLBs are connected via a hierarchy of routing channels.

Thus each chip has an indirect network over a homogeneous fabric.

Xilinx VirtexII FPGAs: various configurable elements (CLBs, RAMs, multipliers,

…).

Programmable interconnection is achieved by routing switches. VirtexII can be seen as an indirect network over a heterogeneous fabric.

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

60

slide-61
SLIDE 61

Crossbar Network

Crosspoint switch complexity increases quadratically with the number of crossbar

input/output ports, N, i.e., grows as O(N2) g

Has the property of being non-blocking

7 6 5 4 3 2 1 7 6 5 4 3 2 1 2 1 2 1 5 4 3 5 4 3 7 6 7 6 5 Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

61

slide-62
SLIDE 62

Multistage Interconnection Networks (MINs)

Crossbar split into several stages consisting of smaller crossbars Complexity grows as O(N × log N), where N is # of end nodes

Complexity grows as O(N log N), where N is # of end nodes

Reduction in MIN switch cost comes at the price of performance

Network has the property of being blocking

p p y g g

Contention is more likely to occur on network links

Paths from different sources to different destinations share one or more links 1 1 5 4 3 2 5 4 3 2 7 6 5 7 6 5 Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

62

Omega topology, perfect-shuffle exchange

slide-63
SLIDE 63

Blocking

7 6 5 4 3 2 1

X

3 2 1 3 2 1 3 2 1 6 5 4 3 5 4 5 4

blocking topology non-blocking topology

7 6 7 6 7 6

How to reduce blocking in MINs? Provide alternative paths!

blocking topology non blocking topology

Use larger switches (can equate to using more switches) Use more switches

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

63

slide-64
SLIDE 64

Comparison of Indirect and Direct Networks

End Nodes Switches

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

64

slide-65
SLIDE 65

On-Chip Interconnection Networks

Shared-Medium Networks Switched-media Networks (Direct and Indirect Networks) Hybrid Networks Hybrid Networks

Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

65

slide-66
SLIDE 66

Hybrid Networks

Advantages of homogeneous interconnection architectures:

Facilitate modular design, Easily scaled up by replication. Suitable for general-purpose computing

H b l fl ibili d fi i f hi li i

However, obstacle to flexibility and fine tuning of architectures to application

characteristics.

Systems developed for a particular application can benefit from a more heterogeneous

y p p pp g communication infrastructure.

Provides high bandwidth in a localized fashion only where it is needed to eliminate bottlenecks.

Hence heterogeneous or hybrid interconnection architectures Hence, heterogeneous, or hybrid interconnection architectures. Energy efficiency is a strong driver toward hybrid architectures.

Examples: multiple-backplane and hierarchical (or bridged) busses. 3 busses in AMBA. Hessabi@Sharif University of Technology SoC: On-Chip Interconnection Networks

66