DUAL CONGESTION AWARENESS SCHEME IN ON-CHIP NETWORKS DEPARTMENT OF - - PowerPoint PPT Presentation
DUAL CONGESTION AWARENESS SCHEME IN ON-CHIP NETWORKS DEPARTMENT OF - - PowerPoint PPT Presentation
DUAL CONGESTION AWARENESS SCHEME IN ON-CHIP NETWORKS DEPARTMENT OF INFORMATION TECHNOLOGY Masoumeh Ebrahimi, Masoud Daneshtalab Juha Plosila, Hannu Tenhunen M ANY - CORE E MBEDDED S YSTEMS Network Interface (NI) Input Packetizer Router Core
MANY-CORE EMBEDDED SYSTEMS
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
N I
Core
R
Arbiter Crossbar
Router (R)
Input
Input
PE
Controller Input
Router Network Interface (NI)
Input Input Input Input Output Output Output Output Output Routing Unit
Packetizer Depacketizer
Memory Processor
MICRO-SLICE SERVERS
Tilera S2Q
MICRO-SLICE SERVERS
Tilera S2Q
4 U, each U 2x64 cores: 512 cores
MICRO-SLICE SERVERS
Tilera S2Q
AXIM: AXI-BASED MANY-CORE EMBEDDED SYSTEM
AXI Interface
Embedded ¡ Controller
MicroBlaze+Linux Ethernet ¡ Controller DDR Memory Controller Ethernet ¡ PHY DDR Memory I/O ¡Controller Compact ¡ FLASH Memory
AX IM ¡on ¡FPGA
AXI ¡BUS AXI ¡BUS
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Master
Plasma + ¡Linux
Slave
AXI Interface
PCI-‑Express ¡ Controller PCI-‑Express PHY
AXI Interface
AXI ¡BUS
Prototype layout on FPGA Vertix-6 25 MIPS-like PEs ICCD 2012 & FPL 2012 & ReCoSoC 2012
t0 t2 t1 t4 t3 t5 w0,2: 4 w0,1: 5 w2,4: 7 w1,4: 6 w2,3: 3 w4,5: 6 w3,5: 4
BIO-based applications:
- SAXPY/GAXPY
- Cholesky
- Jacobi
- QR factorization
Web-server accelerator Stream application: H264, Transcoder
AXIM: AXI-BASED MANY-CORE EMBEDDED SYSTEM
AXI Interface
Embedded ¡ Controller
MicroBlaze+Linux Ethernet ¡ Controller DDR Memory Controller Ethernet ¡ PHY DDR Memory I/O ¡Controller Compact ¡ FLASH Memory
AX IM ¡on ¡FPGA
AXI ¡BUS AXI ¡BUS
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Slave
Plasma + ¡Linux
Master
Plasma + ¡Linux
Slave
AXI Interface
PCI-‑Express ¡ Controller PCI-‑Express PHY
AXI Interface
AXI ¡BUS
Prototype layout on FPGA Vertix-6 25 MIPS-like PEs ICCD 2012 & FPL 2012 & ReCoSoC 2012
t0 t2 t1 t4 t3 t5 w0,2: 4 w0,1: 5 w2,4: 7 w1,4: 6 w2,3: 3 w4,5: 6 w3,5: 4
OS Embedded ¡ Controller
User ¡Process User ¡Process User ¡Process System ¡Process
Micro-‑kernel Master ¡PE
Task ¡Mapping ¡ Algorithm
Micro-‑kernel Slave ¡PE ¡1
App ¡1 ¡– ¡Task ¡1 App ¡2 ¡– ¡Task ¡1 App ¡2 ¡– ¡Task ¡2
Micro-‑kernel Slave ¡PE ¡2
App ¡1 ¡– ¡Task ¡2 App ¡3 ¡– ¡Task ¡1 App ¡2 ¡– ¡Task ¡5
Micro-‑kernel Slave ¡PE ¡N
App ¡3 ¡– ¡Task ¡2 App ¡3 ¡– ¡Task ¡4 App ¡2 ¡– ¡Task ¡3
Communication ¡Infrastructure ¡(NoC)
Memory ¡
(DRAM ¡& ¡FLASH) Data ¡Set ¡1 Data ¡Set ¡2 Data ¡Set ¡3 Data ¡Set ¡M
Application ¡ Repository
Application ¡1 Application ¡2 Application ¡3 Application ¡K
Cluster ¡of ¡PEs
AXIM: AXI-BASED MANY-CORE EMBEDDED SYSTEM
Prototype layout on FPGA Vertix-6 25 MIPS-like PEs ICCD 2012 & FPL 2012 & ReCoSoC 2012
t0 t2 t1 t4 t3 t5 w0,2: 4 w0,1: 5 w2,4: 7 w1,4: 6 w2,3: 3 w4,5: 6 w3,5: 4
NETWORK-ON-CHIP
CONGESTION IN NETWORKS-ON-CHIP
- Congestion
- Is one of the main factors limiting the performance of Networks-on-
Chip.
- Routing algorithms
- Perform an important role in distributing the traffic load over the
network by providing alternative routing paths. However,
- The major parts of research works avoid congestion by considering
the traffic condition in the forward paths and delivering packets through less congested paths. They do not consider the impact of the router arbitration in traffic distribution.
- Traditional methods usually consider the traffic at the node level
rather than a region level.
INPUT SELECTION FUNCTIONS
The performance and efficiency of NoCs largely depend on the output-selection and input-selection
- Input selection function
- Chooses one of input channels to get access to the output channel,
done by an arbitration process.
- The arbiter could follow either non-priority or priority scheme.
- In the priority method, when there are multiple input port requests for the
same available output port, the arbiter grants access to the input port having the highest priority level.
- The priority scheme can help to flatten the network congestion by giving
higher priority level to traffic coming from congested areas.
OUTPUT SELECTION FUNCTIONS
- Output selection function
- Determines which output channel should be chosen for a packet
arrived from an input channel, using a routing algorithm.
- The output selection function can be classified as either congestion-
- blivious or congestion-aware.
- Congestion-oblivious schemes (static/deterministic)
- Decisions are independent of the congestion condition of the
network, such as dimension-order routing.
- This policy cannot disrupt the load since the network status is not
considered.
- Congestion-aware schemes (dynamic/adaptive)
- Usually consider local traffic condition to choose the output channel.
- Routing decisions based on local congestion information may lead
to an unbalanced distribution of traffic load.
TRADITIONAL METHODS: ANOC
- ANoC: Congestion-Aware Selection Method in Agent-
based Network-on-Chip (VLSI-SoC 2011)
- In the Agent-based Network-on-Chip (ANoC) structure, the network
is divided into several clusters in which a cluster includes four routers and a cluster agent. Data network and congestion network.
- Advantages and shortcomings
- Providing a wider view of the network
congestion by considering a group of nodes (regions) rather than single nodes.
- It only uses the congestion condition of
forward paths in the routing decision.
TRADITIONAL METHODS: CATRA
- CATRA: Congestion Aware Trapezoid-based Routing
Algorithm (DATE 2012)
- CATRA tries to collect and utilize the congestion information for just
enough number of nodes in the network.
- In CATRA, the passing probability of packets through the nodes is
- calculated. Based on this measurement, the nodes with high
passing probabilities form a region of trapezoid shapes.
- Advantages and shortcomings
- Efficiently choosing the groups of nodes
which is determined based on the passing probability of packets through the nodes
- It only involves the congestion status of
forward paths in the routing decision.
TRADITIONAL METHODS: GLB
- GLB: Global Load Balancing Method (ReCoSoC 2012)
- Packets aggregate and carry congestion information along a path they
- route. This information contains a global view of the path from where a
packet is routed.
- This global congestion information is used as a metric in the arbitration unit
in order to give more priority to packets arriving from congested area.
- Advantages and shortcomings
- It provides a global congestion information
- This method makes more attention on the congestion information of
backward paths with less attention to forward paths.
1 2 3 5 6 7 8 10 11 12 13 15 16 17 18 4 9 14 19 20 21 22 23 24
SW SE NW NE SW SE NW NE 1 2 3 4
Quadrants Congestion
A B C D
SUMMARY ON TRADITIONAL METHODS
- ANoC, and CATRA are region-based methodologies,
considering only forward paths.
- GLB considers backward paths and based on the global
congestion status.
- Characteristics of DuCA (Dual Congestion Awareness):
- It is a region-based approach, providing a wider view of network
traffic
- It is a scalable approach without using tables
- It considers the congestion information of both forward and
backward paths in the routing decision (output selection) and arbitration unit (input selection)
INPUT SELECTION FUNCTION OF DUCA
- The input selection function examines the priority values
- f competing packets and gives the priority to packets
coming from congested clusters.
- In order to prevent starvation, each time after finding the
highest value, the priorities of defeated packets are incremented.
1 2 3 4 5 6 7 8 9 10 11 17 16 15 14 13 12 23 22 21 20 19 18 29 28 27 26 25 24 35 34 33 32 31 30 1 2 3 4 5 6 7 8
OUTPUT SELECTION FUNCTION OF DUCA
- The output selection function chooses either X or Y
dimensions based on the congestion condition of the neighboring clusters.
1 2 3 4 5 6 7 8 9 10 11 17 16 15 14 13 12 23 22 21 20 19 18 29 28 27 26 25 24 35 34 33 32 31 30 1 2 3 4 5 6 7 8
- If the destination is located in the same
row or column as the neighboring cluster,
- nly the number of occupied buffer cells
at the corresponding input buffers of the neighboring nodes is considered.
- Otherwise, the congestion levels of two
neighboring clusters are compared with each other.
OUTPUT SELECTION FUNCTION OF DUCA
- The output selection function chooses either X or Y
dimensions based on the congestion condition of the neighboring clusters.
1 2 3 4 5 6 7 8 9 10 11 17 16 15 14 13 12 23 22 21 20 19 18 29 28 27 26 25 24 35 34 33 32 31 30 1 2 3 4 5 6 7 8
- If the destination is located in the same
row or column as the neighboring cluster,
- nly the number of occupied buffer cells
at the corresponding input buffers of the neighboring nodes is considered.
- Otherwise, the congestion levels of two
neighboring clusters are compared with each other.
EXPERIMENTAL RESULTS
- A NoC simulator is implemented with VHDL.
- The simulator models all major components of the NoC such as
network interfaces, routers, and wires.
- ANoC and GLB approaches are used as our baseline models.
- System Configuration:
- A 2D mesh with eighteen nodes as
processors and eighteen nodes are memories
- For all routers, the data width (flit
size) is set to 32 bits and the buffer depth of each channel is set to 5 flits.
NI
Mem
R
NI
Mem
R
NI
Mem
R
NI
Proc
R
NI
Proc
R
NI
Proc
R
NI
Mem
R
NI
Mem
R
NI
Mem
R
NI
Mem
R
NI
Mem
R
NI
Mem
R
NI
Proc
R
NI
Proc
R
NI
Proc
R
NI
Mem
R
NI
Mem
R
NI
Mem
R
NI
Proc
R
NI
Proc
R
NI
Proc
R
NI
Mem
R
NI
Mem
R
NI
Mem
R
NI
Proc
R
NI
Proc
R
NI
Proc
R
NI
Proc
R
NI
Proc
R
NI
Proc
R
NI
Mem
R
NI
Mem
R
NI
Mem
R
NI
Proc
R
NI
Proc
R
NI
Proc
R
Performance Evaluation
- We consider uniform and non-uniform/localized
synthetic traffic patterns.
- In uniform traffic, each processor sends read/write requests to
memories with a uniform probability. Hence, the target memory and request type (read or write) are selected randomly.
- In the non-uniform mode, 70% of the traffic is local requests,
where the destination memory is one hop away from the master core, and the remaining 30% of the traffic is uniformly distributed to the non-local memory modules.
Performance Evaluation
- We consider uniform and non-uniform/localized
synthetic traffic patterns.
- In uniform traffic, each processor sends read/write requests to
memories with a uniform probability. Hence, the target memory and request type (read or write) are selected randomly.
- In the non-uniform mode, 70% of the traffic is local requests,
where the destination memory is one hop away from the master core, and the remaining 30% of the traffic is uniformly distributed to the non-local memory modules.
Performance Evaluation
- Compared with GLB and ANoC architectures, the NoC using
DuCA reduces the average latency when the request rate increases under the uniform and non-uniform traffic models.
- The performance gain near the saturation point (0.6) under the
uniform and non-uniform traffic models is about 17%~35% and 16%~38%, respectively.
50 100 150 200 250 300 350 0.1 0.2 0.3 0.4 0.5 0.6 0.65 0.7 0.75 0.8 Average Latency (cycle) Request Rate (fraction of capacity) (b)
GLB DuCA ANOC
50 100 150 200 250 300 350 0.1 0.2 0.3 0.4 0.5 0.6 0.65 Average Latency (cycle) Request Rate (fraction of capacity) (a)
GLB DuCA ANOC
Uniform Traffic Non-uniform Traffic
Physical Implementation
- DuCA, ANoC, and GLB, is synthesized with Synopsys Design
Compiler using the 65nm standard CMOS technology
- Timing constraint of 1 GHz for the system clock and supply
voltage of 1 V.
- The power dissipation of each scheme is calculated under the
uniform traffic model near the saturation point (in 6×6 2D mesh) using Synopsys PrimePower.
Router Area (mm2) Power (mW) GLB Router 0.1873 59 ANoC Router 0.1913 66 DuCA Router 0.1954 69
Conclusion
- Network congestion can limit the performance of NoC due to
increased transmission latency.
- Most of all recent congestion-aware techniques target to
alleviate congestion in the network by delivering packets to the less congested paths.
- Almost none of them take the information about backward paths
into consideration.
- DuCA tries to utilize the congestion status of both forward and
backward paths in the routing decision.
- DuCA provides a better view of the network congestion by
considering traffic at the region-level rather than node-level.
- Experimental results demonstrate that the DuCA method