DUAL CONGESTION AWARENESS SCHEME IN ON-CHIP NETWORKS DEPARTMENT OF - - PowerPoint PPT Presentation

dual congestion awareness scheme in on chip networks
SMART_READER_LITE
LIVE PREVIEW

DUAL CONGESTION AWARENESS SCHEME IN ON-CHIP NETWORKS DEPARTMENT OF - - PowerPoint PPT Presentation

DUAL CONGESTION AWARENESS SCHEME IN ON-CHIP NETWORKS DEPARTMENT OF INFORMATION TECHNOLOGY Masoumeh Ebrahimi, Masoud Daneshtalab Juha Plosila, Hannu Tenhunen M ANY - CORE E MBEDDED S YSTEMS Network Interface (NI) Input Packetizer Router Core


slide-1
SLIDE 1

DEPARTMENT OF INFORMATION TECHNOLOGY

Masoumeh Ebrahimi, Masoud Daneshtalab Juha Plosila, Hannu Tenhunen

DUAL CONGESTION AWARENESS SCHEME IN ON-CHIP NETWORKS

slide-2
SLIDE 2

MANY-CORE EMBEDDED SYSTEMS

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

N I

Core

R

Arbiter Crossbar

Router (R)

Input

Input

PE

Controller Input

Router Network Interface (NI)

Input Input Input Input Output Output Output Output Output Routing Unit

Packetizer Depacketizer

Memory Processor

slide-3
SLIDE 3

MICRO-SLICE SERVERS

Tilera S2Q

slide-4
SLIDE 4

MICRO-SLICE SERVERS

Tilera S2Q

4 U, each U 2x64 cores: 512 cores

slide-5
SLIDE 5

MICRO-SLICE SERVERS

Tilera S2Q

slide-6
SLIDE 6

AXIM: AXI-BASED MANY-CORE EMBEDDED SYSTEM

AXI Interface

Embedded ¡ Controller

MicroBlaze+Linux Ethernet ¡ Controller DDR Memory Controller Ethernet ¡ PHY DDR Memory I/O ¡Controller Compact ¡ FLASH Memory

AX IM ¡on ¡FPGA

AXI ¡BUS AXI ¡BUS

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Master

Plasma + ¡Linux

Slave

AXI Interface

PCI-­‑Express ¡ Controller PCI-­‑Express PHY

AXI Interface

AXI ¡BUS

Prototype layout on FPGA Vertix-6 25 MIPS-like PEs ICCD 2012 & FPL 2012 & ReCoSoC 2012

t0 t2 t1 t4 t3 t5 w0,2: 4 w0,1: 5 w2,4: 7 w1,4: 6 w2,3: 3 w4,5: 6 w3,5: 4

BIO-based applications:

  • SAXPY/GAXPY
  • Cholesky
  • Jacobi
  • QR factorization

Web-server accelerator Stream application: H264, Transcoder

slide-7
SLIDE 7

AXIM: AXI-BASED MANY-CORE EMBEDDED SYSTEM

AXI Interface

Embedded ¡ Controller

MicroBlaze+Linux Ethernet ¡ Controller DDR Memory Controller Ethernet ¡ PHY DDR Memory I/O ¡Controller Compact ¡ FLASH Memory

AX IM ¡on ¡FPGA

AXI ¡BUS AXI ¡BUS

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Slave

Plasma + ¡Linux

Master

Plasma + ¡Linux

Slave

AXI Interface

PCI-­‑Express ¡ Controller PCI-­‑Express PHY

AXI Interface

AXI ¡BUS

Prototype layout on FPGA Vertix-6 25 MIPS-like PEs ICCD 2012 & FPL 2012 & ReCoSoC 2012

t0 t2 t1 t4 t3 t5 w0,2: 4 w0,1: 5 w2,4: 7 w1,4: 6 w2,3: 3 w4,5: 6 w3,5: 4

OS Embedded ¡ Controller

User ¡Process User ¡Process User ¡Process System ¡Process

Micro-­‑kernel Master ¡PE

Task ¡Mapping ¡ Algorithm

Micro-­‑kernel Slave ¡PE ¡1

App ¡1 ¡– ¡Task ¡1 App ¡2 ¡– ¡Task ¡1 App ¡2 ¡– ¡Task ¡2

Micro-­‑kernel Slave ¡PE ¡2

App ¡1 ¡– ¡Task ¡2 App ¡3 ¡– ¡Task ¡1 App ¡2 ¡– ¡Task ¡5

Micro-­‑kernel Slave ¡PE ¡N

App ¡3 ¡– ¡Task ¡2 App ¡3 ¡– ¡Task ¡4 App ¡2 ¡– ¡Task ¡3

Communication ¡Infrastructure ¡(NoC)

Memory ¡

(DRAM ¡& ¡FLASH) Data ¡Set ¡1 Data ¡Set ¡2 Data ¡Set ¡3 Data ¡Set ¡M

Application ¡ Repository

Application ¡1 Application ¡2 Application ¡3 Application ¡K

Cluster ¡of ¡PEs

slide-8
SLIDE 8

AXIM: AXI-BASED MANY-CORE EMBEDDED SYSTEM

Prototype layout on FPGA Vertix-6 25 MIPS-like PEs ICCD 2012 & FPL 2012 & ReCoSoC 2012

t0 t2 t1 t4 t3 t5 w0,2: 4 w0,1: 5 w2,4: 7 w1,4: 6 w2,3: 3 w4,5: 6 w3,5: 4

NETWORK-ON-CHIP

slide-9
SLIDE 9

CONGESTION IN NETWORKS-ON-CHIP

  • Congestion
  • Is one of the main factors limiting the performance of Networks-on-

Chip.

  • Routing algorithms
  • Perform an important role in distributing the traffic load over the

network by providing alternative routing paths. However,

  • The major parts of research works avoid congestion by considering

the traffic condition in the forward paths and delivering packets through less congested paths. They do not consider the impact of the router arbitration in traffic distribution.

  • Traditional methods usually consider the traffic at the node level

rather than a region level.

slide-10
SLIDE 10

INPUT SELECTION FUNCTIONS

The performance and efficiency of NoCs largely depend on the output-selection and input-selection

  • Input selection function
  • Chooses one of input channels to get access to the output channel,

done by an arbitration process.

  • The arbiter could follow either non-priority or priority scheme.
  • In the priority method, when there are multiple input port requests for the

same available output port, the arbiter grants access to the input port having the highest priority level.

  • The priority scheme can help to flatten the network congestion by giving

higher priority level to traffic coming from congested areas.

slide-11
SLIDE 11

OUTPUT SELECTION FUNCTIONS

  • Output selection function
  • Determines which output channel should be chosen for a packet

arrived from an input channel, using a routing algorithm.

  • The output selection function can be classified as either congestion-
  • blivious or congestion-aware.
  • Congestion-oblivious schemes (static/deterministic)
  • Decisions are independent of the congestion condition of the

network, such as dimension-order routing.

  • This policy cannot disrupt the load since the network status is not

considered.

  • Congestion-aware schemes (dynamic/adaptive)
  • Usually consider local traffic condition to choose the output channel.
  • Routing decisions based on local congestion information may lead

to an unbalanced distribution of traffic load.

slide-12
SLIDE 12

TRADITIONAL METHODS: ANOC

  • ANoC: Congestion-Aware Selection Method in Agent-

based Network-on-Chip (VLSI-SoC 2011)

  • In the Agent-based Network-on-Chip (ANoC) structure, the network

is divided into several clusters in which a cluster includes four routers and a cluster agent. Data network and congestion network.

  • Advantages and shortcomings
  • Providing a wider view of the network

congestion by considering a group of nodes (regions) rather than single nodes.

  • It only uses the congestion condition of

forward paths in the routing decision.

slide-13
SLIDE 13

TRADITIONAL METHODS: CATRA

  • CATRA: Congestion Aware Trapezoid-based Routing

Algorithm (DATE 2012)

  • CATRA tries to collect and utilize the congestion information for just

enough number of nodes in the network.

  • In CATRA, the passing probability of packets through the nodes is
  • calculated. Based on this measurement, the nodes with high

passing probabilities form a region of trapezoid shapes.

  • Advantages and shortcomings
  • Efficiently choosing the groups of nodes

which is determined based on the passing probability of packets through the nodes

  • It only involves the congestion status of

forward paths in the routing decision.

slide-14
SLIDE 14

TRADITIONAL METHODS: GLB

  • GLB: Global Load Balancing Method (ReCoSoC 2012)
  • Packets aggregate and carry congestion information along a path they
  • route. This information contains a global view of the path from where a

packet is routed.

  • This global congestion information is used as a metric in the arbitration unit

in order to give more priority to packets arriving from congested area.

  • Advantages and shortcomings
  • It provides a global congestion information
  • This method makes more attention on the congestion information of

backward paths with less attention to forward paths.

1 2 3 5 6 7 8 10 11 12 13 15 16 17 18 4 9 14 19 20 21 22 23 24

SW SE NW NE SW SE NW NE 1 2 3 4

Quadrants Congestion

A B C D

slide-15
SLIDE 15

SUMMARY ON TRADITIONAL METHODS

  • ANoC, and CATRA are region-based methodologies,

considering only forward paths.

  • GLB considers backward paths and based on the global

congestion status.

  • Characteristics of DuCA (Dual Congestion Awareness):
  • It is a region-based approach, providing a wider view of network

traffic

  • It is a scalable approach without using tables
  • It considers the congestion information of both forward and

backward paths in the routing decision (output selection) and arbitration unit (input selection)

slide-16
SLIDE 16

INPUT SELECTION FUNCTION OF DUCA

  • The input selection function examines the priority values
  • f competing packets and gives the priority to packets

coming from congested clusters.

  • In order to prevent starvation, each time after finding the

highest value, the priorities of defeated packets are incremented.

1 2 3 4 5 6 7 8 9 10 11 17 16 15 14 13 12 23 22 21 20 19 18 29 28 27 26 25 24 35 34 33 32 31 30 1 2 3 4 5 6 7 8

slide-17
SLIDE 17

OUTPUT SELECTION FUNCTION OF DUCA

  • The output selection function chooses either X or Y

dimensions based on the congestion condition of the neighboring clusters.

1 2 3 4 5 6 7 8 9 10 11 17 16 15 14 13 12 23 22 21 20 19 18 29 28 27 26 25 24 35 34 33 32 31 30 1 2 3 4 5 6 7 8

  • If the destination is located in the same

row or column as the neighboring cluster,

  • nly the number of occupied buffer cells

at the corresponding input buffers of the neighboring nodes is considered.

  • Otherwise, the congestion levels of two

neighboring clusters are compared with each other.

slide-18
SLIDE 18

OUTPUT SELECTION FUNCTION OF DUCA

  • The output selection function chooses either X or Y

dimensions based on the congestion condition of the neighboring clusters.

1 2 3 4 5 6 7 8 9 10 11 17 16 15 14 13 12 23 22 21 20 19 18 29 28 27 26 25 24 35 34 33 32 31 30 1 2 3 4 5 6 7 8

  • If the destination is located in the same

row or column as the neighboring cluster,

  • nly the number of occupied buffer cells

at the corresponding input buffers of the neighboring nodes is considered.

  • Otherwise, the congestion levels of two

neighboring clusters are compared with each other.

slide-19
SLIDE 19

EXPERIMENTAL RESULTS

  • A NoC simulator is implemented with VHDL.
  • The simulator models all major components of the NoC such as

network interfaces, routers, and wires.

  • ANoC and GLB approaches are used as our baseline models.
  • System Configuration:
  • A 2D mesh with eighteen nodes as

processors and eighteen nodes are memories

  • For all routers, the data width (flit

size) is set to 32 bits and the buffer depth of each channel is set to 5 flits.

NI

Mem

R

NI

Mem

R

NI

Mem

R

NI

Proc

R

NI

Proc

R

NI

Proc

R

NI

Mem

R

NI

Mem

R

NI

Mem

R

NI

Mem

R

NI

Mem

R

NI

Mem

R

NI

Proc

R

NI

Proc

R

NI

Proc

R

NI

Mem

R

NI

Mem

R

NI

Mem

R

NI

Proc

R

NI

Proc

R

NI

Proc

R

NI

Mem

R

NI

Mem

R

NI

Mem

R

NI

Proc

R

NI

Proc

R

NI

Proc

R

NI

Proc

R

NI

Proc

R

NI

Proc

R

NI

Mem

R

NI

Mem

R

NI

Mem

R

NI

Proc

R

NI

Proc

R

NI

Proc

R

slide-20
SLIDE 20

Performance Evaluation

  • We consider uniform and non-uniform/localized

synthetic traffic patterns.

  • In uniform traffic, each processor sends read/write requests to

memories with a uniform probability. Hence, the target memory and request type (read or write) are selected randomly.

  • In the non-uniform mode, 70% of the traffic is local requests,

where the destination memory is one hop away from the master core, and the remaining 30% of the traffic is uniformly distributed to the non-local memory modules.

slide-21
SLIDE 21

Performance Evaluation

  • We consider uniform and non-uniform/localized

synthetic traffic patterns.

  • In uniform traffic, each processor sends read/write requests to

memories with a uniform probability. Hence, the target memory and request type (read or write) are selected randomly.

  • In the non-uniform mode, 70% of the traffic is local requests,

where the destination memory is one hop away from the master core, and the remaining 30% of the traffic is uniformly distributed to the non-local memory modules.

slide-22
SLIDE 22

Performance Evaluation

  • Compared with GLB and ANoC architectures, the NoC using

DuCA reduces the average latency when the request rate increases under the uniform and non-uniform traffic models.

  • The performance gain near the saturation point (0.6) under the

uniform and non-uniform traffic models is about 17%~35% and 16%~38%, respectively.

50 100 150 200 250 300 350 0.1 0.2 0.3 0.4 0.5 0.6 0.65 0.7 0.75 0.8 Average Latency (cycle) Request Rate (fraction of capacity) (b)

GLB DuCA ANOC

50 100 150 200 250 300 350 0.1 0.2 0.3 0.4 0.5 0.6 0.65 Average Latency (cycle) Request Rate (fraction of capacity) (a)

GLB DuCA ANOC

Uniform Traffic Non-uniform Traffic

slide-23
SLIDE 23

Physical Implementation

  • DuCA, ANoC, and GLB, is synthesized with Synopsys Design

Compiler using the 65nm standard CMOS technology

  • Timing constraint of 1 GHz for the system clock and supply

voltage of 1 V.

  • The power dissipation of each scheme is calculated under the

uniform traffic model near the saturation point (in 6×6 2D mesh) using Synopsys PrimePower.

Router Area (mm2) Power (mW) GLB Router 0.1873 59 ANoC Router 0.1913 66 DuCA Router 0.1954 69

slide-24
SLIDE 24

Conclusion

  • Network congestion can limit the performance of NoC due to

increased transmission latency.

  • Most of all recent congestion-aware techniques target to

alleviate congestion in the network by delivering packets to the less congested paths.

  • Almost none of them take the information about backward paths

into consideration.

  • DuCA tries to utilize the congestion status of both forward and

backward paths in the routing decision.

  • DuCA provides a better view of the network congestion by

considering traffic at the region-level rather than node-level.

  • Experimental results demonstrate that the DuCA method

provides significant improvement in average network latency compared with traditional methods.