TCEP: Traffic Consolidation for Energy-Proportional High-Radix - - PowerPoint PPT Presentation

tcep traffic consolidation for energy proportional high
SMART_READER_LITE
LIVE PREVIEW

TCEP: Traffic Consolidation for Energy-Proportional High-Radix - - PowerPoint PPT Presentation

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks Gwangsun Kim Hayoung Choi, John Kim Arm Research KAIST High-radix Networks Dragonfly network in Cray XC30 system 1D Flattened butterfly (fully connected) Image source:


slide-1
SLIDE 1

TCEP: Traffic Consolidation for Energy-Proportional High-Radix Networks

Hayoung Choi, John Kim KAIST Gwangsun Kim Arm Research

slide-2
SLIDE 2

High-radix Networks

▪ A large number of narrow links → low network diameter, high path diversity ▪ Energy-proportionality can be challenging

– Links use high-speed signaling – High energy consumption regardless of load (‘Idle’ packets transmitted)

Dragonfly network in Cray XC30 system

Image source: Cray

1D Flattened butterfly (fully connected)

… …

2D Flattened butterfly (fully connected within each dimension)

slide-3
SLIDE 3

Motivation

▪ Data center networks can be significantly underutilized

– Resources provisioned to meet peak demand – Low link utilization measured by Facebook

▪ Network energy waste can be high at low system utilization ▪ Exploit link power-gating opportunity in high-radix routers

[Rot et al., SIGCOMM’15] [Abts et al., ISCA’10]

slide-4
SLIDE 4

Motivation

▪ Data center networks can be significantly underutilized

– Resources provisioned to meet peak demand – Low link utilization measured by Facebook

▪ Network energy waste can be high at low system utilization ▪ Exploit link power-gating opportunity in high-radix routers

[Rot et al., SIGCOMM’15] [Abts et al., ISCA’10]

Link power-gating challenges:

  • How to maximize power reduction?
  • How to keep network connected?
  • How to achieve scalability?
  • How to minimize performance impact?
  • How to load-balance network?
slide-5
SLIDE 5

Contents

▪ Background / Motivation ▪ Traffic consolidation ▪ Maintaining connectivity ▪ Criteria for selecting links to power-gate ▪ Power-aware load-balanced routing ▪ Evaluation ▪ Conclusion

slide-6
SLIDE 6

Traffic Consolidation

▪ Energy-proportionality requires aggressive power-gating ▪ Consolidate flows onto fewer links thru non-minimal routing

Flow 2: 50% link util. Flow 1: 50% link util. Flow 0: 50% link util.

Router

Flow 2: 50% link util.

… …

Flow 0: 50% link util. Flow 1: 25% link util. Flow 1: 25% link util.

Traffic consolidation

Router

slide-7
SLIDE 7

Subnetwork-based Distributed Approach

▪ Subnetwork: routers that are fully connected in a dimension ▪ Independently manage power with local information

R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R14 R15 R12 R4 R3 R0

2D Flattened butterfly

R0 R1 R2 R3

Fully connected within each dimension

slide-8
SLIDE 8

Subnetwork-based Distributed Approach

▪ Subnetwork: routers that are fully connected in a dimension ▪ Independently manage power with local information

R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R14 R15 R12 R4 R3 R0

2D Flattened butterfly Row subnetworks

slide-9
SLIDE 9

Subnetwork-based Distributed Approach

▪ Subnetwork: routers that are fully connected in a dimension ▪ Independently manage power with local information

R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R14 R15 R12 R4 R3 R0

2D Flattened butterfly Column subnetworks

slide-10
SLIDE 10

Subnetwork-based Distributed Approach

▪ Subnetwork: routers that are fully connected in a dimension ▪ Independently manage power with local information

R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R14 R15 R12 R4 R3 R0

2D Flattened butterfly 1D Flattened butterfly Consists of a single subnetwork

slide-11
SLIDE 11

Root Network – Maintaining Connectivity

▪ Constantly checking connectivity can incur high overhead ▪ Subset of links that are always ON to keep all nodes connected ▪ Star topology → minimal # of links and low network diameter

Root network links Other links Root network for 1D Flattened butterfly Rearrange the root network

R1 R2 R3 R4 R5 R6 R7 R0

  • Max. hop count = 2
slide-12
SLIDE 12

Root network links Other links

Root Network for Higher Dimensions

▪ Star topology is formed within each subnetwork

R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 R0 R2 R3 R1 R0

Root network for 2D Flattened butterfly

slide-13
SLIDE 13

Root network links Other links

Root Network for Higher Dimensions

▪ Star topology is formed within each subnetwork ▪ Further reducing ON links? → too complex, little added benefit (4.3% for radix-64 routers)

R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 R0

Root network for 2D Flattened butterfly

slide-14
SLIDE 14

Root network links Other links

Root Network for Higher Dimensions

▪ Star topology is formed within each subnetwork ▪ Further reducing ON links? → too complex, little added benefit (4.3% for radix-64 routers) ▪ Other links can be power-gated without affecting connectivity

R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 R0

Root network for 2D Flattened butterfly

slide-15
SLIDE 15

Observation on Maximizing Path Diversity

▪ Which links should be ON for high path diversity?

Approach 1: distribute ON links Approach 2: concentrate ON links ON (root network) OFF ON (additional links)

Provides better path diversity

slide-16
SLIDE 16

0.2 0.4 0.6 0.8 1 20 40 60 80 100 Normalized number of total paths Fraction of active links (%) Concentrate to "hub" routers Randomly distribute

Hub Routers for High Path Diversity

▪ ‘Hub’ routers created → ‘small-world’ network ▪ Similarly, airlines create hub airports to reduce cost ▪ Quantitative results:

– 1D Flattened butterfly (32 routers, 1024 nodes) – No non-minimal paths with more than 2 hops – Random distribution: average from 10,000 samples

Total # of paths improved by 1.9x

Edge: Direct flights by United Airlines

Hub airport

slide-17
SLIDE 17

0.2 0.4 0.6 0.8 1 20 40 60 80 100 Normalized number of total paths Fraction of active links (%) Concentrate to "hub" routers Randomly distribute

Hub Routers for High Path Diversity

▪ ‘Hub’ routers created → ‘small-world’ network ▪ Similarly, airlines create hub airports to reduce cost ▪ Quantitative results:

– 1D Flattened butterfly (32 routers, 1024 nodes) – No non-minimal paths with more than 2 hops – Random distribution: average from 10,000 samples

Total # of paths improved by 1.9x

Edge: Direct flights by United Airlines

Hub airport

TCEP concentrates ON links to a small number of “hub” routers.

slide-18
SLIDE 18

0.2 0.4 0.6 0.8 1 20 40 60 80 100 Normalized number of total paths Fraction of active links (%) Concentrate to "hub" routers Randomly distribute

Hub Routers for High Path Diversity

▪ ‘Hub’ routers created → ‘small-world’ network ▪ Similarly, airlines create hub airports to reduce cost ▪ Quantitative results:

– 1D Flattened butterfly (32 routers, 1024 nodes) – No non-minimal paths with more than 2 hops – Random distribution: average from 10,000 samples

Total # of paths improved by 1.9x

slide-19
SLIDE 19

Observation on Minimizing Impact on Network

Differentiate the type of traffic (minimally vs. non-minimally routed)

Power-gate Candidate 1 Power-gate Candidate 2

R2 R1 R3 R0 R2 R1 R3 R0 R2 R1 R3 R0

Making power-gating decision at R0 Increased hop count & BW usage The same hop count & BW usage Minimally routed Non-minimally routed ON OFF

slide-20
SLIDE 20

Observation on Minimizing Impact on Network

Differentiate the type of traffic (minimally vs. non-minimally routed)

Power-gate Candidate 1 Power-gate Candidate 2

R2 R1 R3 R0 R2 R1 R3 R0 R2 R1 R3 R0

Making power-gating decision at R0 Increased hop count & BW usage The same hop count & BW usage Minimally routed Non-minimally routed

TCEP prioritizes power-gating links with the least amount of minimally routed traffic.

ON OFF

slide-21
SLIDE 21

R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 SRC DEST

Problem with Load-balanced Routing

Baseline (no power-gating) 4 hops

Congestion

INTM R10 R10

▪ No global link state information → non-minimal path can become significantly longer ▪ Baseline non-minimal routing: Source → Intermediate (INTM) router → Destination

slide-22
SLIDE 22

R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 SRC INTM DEST

With power-gating

R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 SRC DEST

Congestion

Problem with Load-balanced Routing

Baseline (no power-gating) 4 hops

8 hops

Congestion

Some links are OFF!

INTM R10 R10

▪ No global link state information → non-minimal path can become significantly longer ▪ Baseline non-minimal routing: Source → Intermediate (INTM) router → Destination

slide-23
SLIDE 23

R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 SRC INTM DEST

With power-gating

R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R4 R3 R0 R14 SRC DEST

Congestion

Problem with Load-balanced Routing

Baseline (no power-gating) 4 hops

8 hops

Congestion

INTM R10 R10

▪ No global link state information → non-minimal path can become significantly longer ▪ Baseline non-minimal routing: Source → Intermediate (INTM) router → Destination

slide-24
SLIDE 24

R10 R11 R13 R8 R9 R2 R6 R7 R5 R1 R15 R12 R14 SRC R3 R4

Proposed PAL Routing

▪ PAL: Power-Aware progressive Load-balanced (Routing) ▪ Instead of global randomization, locally randomize ▪ Compared to the baseline load-balanced routing:

– The same maximum hop count (2 hops in each dimension) – No additional virtual channels (dimension-order routing) DEST INTM_Y

Congestion

R3 INTM_X R4 R0 4 hops DEST_X

slide-25
SLIDE 25

Other Issues Addressed in the Paper

▪ Challenge: the observations can lead to different decisions → We propose a low-complexity algorithm to reconcile them ▪ Only one link turned on/off from a router per epoch → avoid supply voltage shift ▪ Routers keep track of non-minimal paths within a subnetwork ▪ More details in the paper..

slide-26
SLIDE 26

Methodology

▪ Booksim: cycle-accurate interconnection network simulator ▪ SST/Macro + Booksim for real workload evaluation ▪ Previous work compared: SLaC (Staged Laser Control) [HPCA’16]

– A “stage” corresponds to a row of routers in 2D Flattened butterfly – Turn on/off a stage based in a coarse-grained manner

▪ Network configuration:

Parameter Value Topology 512-node 2D Flattened butterfly Virtual channel 6 VCs for baseline, 7 VCs for TCEP and SLaC Link wake-up delay 1 µs (=epoch length. Sensitivity study in paper)

slide-27
SLIDE 27

▪ Performance ▪ Energy

20 40 60 80 100 0.2 0.4 0.6 0.8 1 Average packet latency (cycles) Injection rate (flits/cycle/node) Baseline SLaC TCEP 20 40 60 80 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Average packet latency (cycles) Injection rate (flits/cycle/node) Baseline SLaC TCEP

Uniform random traffic pattern Bit reverse traffic pattern

Synthetic Traffic Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Normalized energy per flit Injection rate (flits/cycle/node) Baseline SLaC TCEP

Uniform random traffic pattern

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Normalized energy per flit Injection rate (flits/cycle/node) Baseline SLaC TCEP

Bit reverse traffic pattern 7x throughput difference Up to 73% lower energy

slide-28
SLIDE 28

1.0 1.5 2.0 2.5 3.0 3.5 4.0 20 40 60 80 100 Mapping number Energy Ratio (SLaC/TCEP)

Multi-workload Scenario Results

▪ Two different batch workloads running simultaneously

– High and low injection rates – 100 random node mappings

1.00 1.02 1.04 1.06 1.08 1.10 1.12 20 40 60 80 100 Mapping number Energy Ratio (SLaC/TCEP)

Uniform random traffic pattern Random permutation traffic pattern 12% lower energy by TCEP 73% lower energy by TCEP Runtime was similar (within 0.3%) TCEP was 1.9-3.6x faster than SLaC

slide-29
SLIDE 29

Real Workload Results

▪ Packet latency

SLaC significantly increased latency for some workloads

▪ Network energy

Significant energy savings by both SLaC and TCEP

1.61 1.15 0.0 0.5 1.0 1.5 2.0 2.5 HILO BoxMG MG NB FB BigFFT GMEAN Normalized average packet latency

Baseline SLaC TCEP

4.51 0.255 0.254 0.0 0.2 0.4 0.6 0.8 1.0 HILO BoxMG MG NB FB BigFFT GMEAN Normalized energy

Baseline SLaC TCEP

Overall, 46% lower compared to SLaC

slide-30
SLIDE 30

Conclusion

▪ TCEP consolidates traffic to proactively power-gate links ▪ Key observations

– Concentrating ON links to “hub” routers for high path diversity – Differentiate minimally and non-minimally routed traffic

▪ PAL (Power-Aware progressive Load-balanced) routing

– Load-balanced routing without global link state information – Keep the max. hop count the same as baseline without power-gating

▪ Results (compared to SLaC)

– For adversarial traffic patterns, up to 7x higher throughput – For multi-workload scenarios, up to ~70% lower energy and runtime