TCEP: Traffic Consolidation for Energy-Proportional High-Radix - - PowerPoint PPT Presentation
TCEP: Traffic Consolidation for Energy-Proportional High-Radix - - PowerPoint PPT Presentation
TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks Gwangsun Kim Hayoung Choi, John Kim Arm Research KAIST High-radix Networks Dragonfly network in Cray XC30 system 1D Flattened butterfly (fully connected) Image source:
High-radix Networks
▪ A large number of narrow links → low network diameter, high path diversity ▪ Energy-proportionality can be challenging
– Links use high-speed signaling – High energy consumption regardless of load (‘Idle’ packets transmitted)
Dragonfly network in Cray XC30 system
Image source: Cray
1D Flattened butterfly (fully connected)
… …
2D Flattened butterfly (fully connected within each dimension)
Motivation
▪ Data center networks can be significantly underutilized
– Resources provisioned to meet peak demand – Low link utilization measured by Facebook
▪ Network energy waste can be high at low system utilization ▪ Exploit link power-gating opportunity in high-radix routers
[Rot et al., SIGCOMM’15] [Abts et al., ISCA’10]
Motivation
▪ Data center networks can be significantly underutilized
– Resources provisioned to meet peak demand – Low link utilization measured by Facebook
▪ Network energy waste can be high at low system utilization ▪ Exploit link power-gating opportunity in high-radix routers
[Rot et al., SIGCOMM’15] [Abts et al., ISCA’10]
Link power-gating challenges:
- How to maximize power reduction?
- How to keep network connected?
- How to achieve scalability?
- How to minimize performance impact?
- How to load-balance network?
Contents
▪ Background / Motivation ▪ Traffic consolidation ▪ Maintaining connectivity ▪ Criteria for selecting links to power-gate ▪ Power-aware load-balanced routing ▪ Evaluation ▪ Conclusion
Traffic Consolidation
▪ Energy-proportionality requires aggressive power-gating ▪ Consolidate flows onto fewer links thru non-minimal routing
…
Flow 2: 50% link util. Flow 1: 50% link util. Flow 0: 50% link util.
…
Router
Flow 2: 50% link util.
… …
Flow 0: 50% link util. Flow 1: 25% link util. Flow 1: 25% link util.
Traffic consolidation
Router
Subnetwork-based Distributed Approach
▪ Subnetwork: routers that are fully connected in a dimension ▪ Independently manage power with local information
R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R14 R15 R12 R4 R3 R0
2D Flattened butterfly
R0 R1 R2 R3
Fully connected within each dimension
Subnetwork-based Distributed Approach
▪ Subnetwork: routers that are fully connected in a dimension ▪ Independently manage power with local information
R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R14 R15 R12 R4 R3 R0
2D Flattened butterfly Row subnetworks
Subnetwork-based Distributed Approach
▪ Subnetwork: routers that are fully connected in a dimension ▪ Independently manage power with local information
R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R14 R15 R12 R4 R3 R0
2D Flattened butterfly Column subnetworks
Subnetwork-based Distributed Approach
▪ Subnetwork: routers that are fully connected in a dimension ▪ Independently manage power with local information
R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R14 R15 R12 R4 R3 R0
2D Flattened butterfly 1D Flattened butterfly Consists of a single subnetwork
Root Network – Maintaining Connectivity
▪ Constantly checking connectivity can incur high overhead ▪ Subset of links that are always ON to keep all nodes connected ▪ Star topology → minimal # of links and low network diameter
Root network links Other links Root network for 1D Flattened butterfly Rearrange the root network
R1 R2 R3 R4 R5 R6 R7 R0
- Max. hop count = 2
Root network links Other links
Root Network for Higher Dimensions
▪ Star topology is formed within each subnetwork
R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 R0 R2 R3 R1 R0
Root network for 2D Flattened butterfly
Root network links Other links
Root Network for Higher Dimensions
▪ Star topology is formed within each subnetwork ▪ Further reducing ON links? → too complex, little added benefit (4.3% for radix-64 routers)
R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 R0
Root network for 2D Flattened butterfly
Root network links Other links
Root Network for Higher Dimensions
▪ Star topology is formed within each subnetwork ▪ Further reducing ON links? → too complex, little added benefit (4.3% for radix-64 routers) ▪ Other links can be power-gated without affecting connectivity
R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 R0
Root network for 2D Flattened butterfly
Observation on Maximizing Path Diversity
▪ Which links should be ON for high path diversity?
Approach 1: distribute ON links Approach 2: concentrate ON links ON (root network) OFF ON (additional links)
Provides better path diversity
0.2 0.4 0.6 0.8 1 20 40 60 80 100 Normalized number of total paths Fraction of active links (%) Concentrate to "hub" routers Randomly distribute
Hub Routers for High Path Diversity
▪ ‘Hub’ routers created → ‘small-world’ network ▪ Similarly, airlines create hub airports to reduce cost ▪ Quantitative results:
– 1D Flattened butterfly (32 routers, 1024 nodes) – No non-minimal paths with more than 2 hops – Random distribution: average from 10,000 samples
Total # of paths improved by 1.9x
Edge: Direct flights by United Airlines
Hub airport
0.2 0.4 0.6 0.8 1 20 40 60 80 100 Normalized number of total paths Fraction of active links (%) Concentrate to "hub" routers Randomly distribute
Hub Routers for High Path Diversity
▪ ‘Hub’ routers created → ‘small-world’ network ▪ Similarly, airlines create hub airports to reduce cost ▪ Quantitative results:
– 1D Flattened butterfly (32 routers, 1024 nodes) – No non-minimal paths with more than 2 hops – Random distribution: average from 10,000 samples
Total # of paths improved by 1.9x
Edge: Direct flights by United Airlines
Hub airport
TCEP concentrates ON links to a small number of “hub” routers.
0.2 0.4 0.6 0.8 1 20 40 60 80 100 Normalized number of total paths Fraction of active links (%) Concentrate to "hub" routers Randomly distribute
Hub Routers for High Path Diversity
▪ ‘Hub’ routers created → ‘small-world’ network ▪ Similarly, airlines create hub airports to reduce cost ▪ Quantitative results:
– 1D Flattened butterfly (32 routers, 1024 nodes) – No non-minimal paths with more than 2 hops – Random distribution: average from 10,000 samples
Total # of paths improved by 1.9x
Observation on Minimizing Impact on Network
Differentiate the type of traffic (minimally vs. non-minimally routed)
Power-gate Candidate 1 Power-gate Candidate 2
R2 R1 R3 R0 R2 R1 R3 R0 R2 R1 R3 R0
Making power-gating decision at R0 Increased hop count & BW usage The same hop count & BW usage Minimally routed Non-minimally routed ON OFF
Observation on Minimizing Impact on Network
Differentiate the type of traffic (minimally vs. non-minimally routed)
Power-gate Candidate 1 Power-gate Candidate 2
R2 R1 R3 R0 R2 R1 R3 R0 R2 R1 R3 R0
Making power-gating decision at R0 Increased hop count & BW usage The same hop count & BW usage Minimally routed Non-minimally routed
TCEP prioritizes power-gating links with the least amount of minimally routed traffic.
ON OFF
R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 SRC DEST
Problem with Load-balanced Routing
Baseline (no power-gating) 4 hops
Congestion
INTM R10 R10
▪ No global link state information → non-minimal path can become significantly longer ▪ Baseline non-minimal routing: Source → Intermediate (INTM) router → Destination
R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 SRC INTM DEST
With power-gating
R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 SRC DEST
Congestion
Problem with Load-balanced Routing
Baseline (no power-gating) 4 hops
8 hops
Congestion
Some links are OFF!
INTM R10 R10
▪ No global link state information → non-minimal path can become significantly longer ▪ Baseline non-minimal routing: Source → Intermediate (INTM) router → Destination
R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 SRC INTM DEST
With power-gating
R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 SRC DEST
Congestion
Problem with Load-balanced Routing
Baseline (no power-gating) 4 hops
8 hops
Congestion
INTM R10 R10
▪ No global link state information → non-minimal path can become significantly longer ▪ Baseline non-minimal routing: Source → Intermediate (INTM) router → Destination
R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R14 SRC R3 R4
Proposed PAL Routing
▪ PAL: Power-Aware progressive Load-balanced (Routing) ▪ Instead of global randomization, locally randomize ▪ Compared to the baseline load-balanced routing:
– The same maximum hop count (2 hops in each dimension) – No additional virtual channels (dimension-order routing) DEST INTM_Y
Congestion
R3 INTM_X R4 R0 4 hops DEST_X
Other Issues Addressed in the Paper
▪ Challenge: the observations can lead to different decisions → We propose a low-complexity algorithm to reconcile them ▪ Only one link turned on/off from a router per epoch → avoid supply voltage shift ▪ Routers keep track of non-minimal paths within a subnetwork ▪ More details in the paper..
Methodology
▪ Booksim: cycle-accurate interconnection network simulator ▪ SST/Macro + Booksim for real workload evaluation ▪ Previous work compared: SLaC (Staged Laser Control) [HPCA’16]
– A “stage” corresponds to a row of routers in 2D Flattened butterfly – Turn on/off a stage based in a coarse-grained manner
▪ Network configuration:
Parameter Value Topology 512-node 2D Flattened butterfly Virtual channel 6 VCs for baseline, 7 VCs for TCEP and SLaC Link wake-up delay 1 µs (=epoch length. Sensitivity study in paper)
▪ Performance ▪ Energy
20 40 60 80 100 0.2 0.4 0.6 0.8 1 Average packet latency (cycles) Injection rate (flits/cycle/node) Baseline SLaC TCEP 20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Average packet latency (cycles) Injection rate (flits/cycle/node) Baseline SLaC TCEP
Uniform random traffic pattern Bit reverse traffic pattern
Synthetic Traffic Results
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Normalized energy per flit Injection rate (flits/cycle/node) Baseline SLaC TCEP
Uniform random traffic pattern
0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Normalized energy per flit Injection rate (flits/cycle/node) Baseline SLaC TCEP
Bit reverse traffic pattern 7x throughput difference Up to 73% lower energy
1.0 1.5 2.0 2.5 3.0 3.5 4.0 20 40 60 80 100 Mapping number Energy Ratio (SLaC/TCEP)
Multi-workload Scenario Results
▪ Two different batch workloads running simultaneously
– High and low injection rates – 100 random node mappings
1.00 1.02 1.04 1.06 1.08 1.10 1.12 20 40 60 80 100 Mapping number Energy Ratio (SLaC/TCEP)
Uniform random traffic pattern Random permutation traffic pattern 12% lower energy by TCEP 73% lower energy by TCEP Runtime was similar (within 0.3%) TCEP was 1.9-3.6x faster than SLaC
Real Workload Results
▪ Packet latency
SLaC significantly increased latency for some workloads
▪ Network energy
Significant energy savings by both SLaC and TCEP
1.61 1.15 0.0 0.5 1.0 1.5 2.0 2.5 HILO BoxMG MG NB FB BigFFT GMEAN Normalized average packet latency
Baseline SLaC TCEP
4.51 0.255 0.254 0.0 0.2 0.4 0.6 0.8 1.0 HILO BoxMG MG NB FB BigFFT GMEAN Normalized energy
Baseline SLaC TCEP