Friends, not Foes Synthesizing Existing Transport Strategies for - - PowerPoint PPT Presentation

friends not foes synthesizing existing transport
SMART_READER_LITE
LIVE PREVIEW

Friends, not Foes Synthesizing Existing Transport Strategies for - - PowerPoint PPT Presentation

Friends, not Foes Synthesizing Existing Transport Strategies for Data Center Networks Ali Munir Michigan State University Michigan State University Ghufran Baig, Syed M. Irteza, Ihsan A. Qazi, Alex X. Liu, Fahad R. Dogar Data Center (DC)


slide-1
SLIDE 1

Friends, not Foes – Synthesizing Existing Transport Strategies for Data Center Networks

Ali Munir

Michigan State University Michigan State University

Ghufran Baig, Syed M. Irteza, Ihsan A. Qazi, Alex X. Liu, Fahad R. Dogar

slide-2
SLIDE 2

Data Center (DC) Applications

  • Distributed applications

Components interact via the network e.g., a bing search query touches > 100 machines

Search Mail Map- Reduce Map- Reduce HPC Monitoring

  • Network impacts performance

“10% of search responses

  • bserve 1 to 14 ms of network

queuing delay”

[ DCTCP, SIGCOMM 10]

Image source: http://cdn.slashgear.com/wp-content/uploads/2012/10/google-datacenter-tech-13.jpg

slide-3
SLIDE 3

DC Network Resource Allocation

  • Fair Sharing

Equal bandwidth sharing among jobs [TCP, DCTCP] – Increases completion time for everyone – Traditional “fairness” metrics less relevant

  • QoS Aware

Prioritize some jobs over other jobs (Priority Scheduling) – Minimize flow completion times [pFabric, L2DCT] – Meet flow deadlines [D3, D2TCP]

slide-4
SLIDE 4

DC Transports

DCTCP SIGCOMM’10 D2TCP SIGCOMM’12 L2DCT INFOCOM’13 D3 SIGCOMM’11 PDQ SIGCOMM’12 pFabric SIGCOMM’13

slide-5
SLIDE 5

DC Transports

DCTCP SIGCOMM’10 D2TCP SIGCOMM’12 L2DCT INFOCOM’13 D3 SIGCOMM’11 PDQ SIGCOMM’12 pFabric SIGCOMM’13

Near Optimal but not Deployment Friendly

(Changes in data plane)

slide-6
SLIDE 6

DC Transports

DCTCP SIGCOMM’10 D2TCP SIGCOMM’12 L2DCT INFOCOM’13 D3 SIGCOMM’11 PDQ SIGCOMM’12 pFabric SIGCOMM’13

Deployment Friendly but Suboptimal Near Optimal but not Deployment Friendly

(Changes in data plane)

slide-7
SLIDE 7

DC Transports

DCTCP SIGCOMM’10 D2TCP SIGCOMM’12 L2DCT INFOCOM’13 D3 SIGCOMM’11 PDQ SIGCOMM’12 pFabric SIGCOMM’13

Deployment Friendly but Suboptimal Near Optimal but not Deployment Friendly

(Changes in data plane)

Step back and ask

How can we design a deployment friendly and near

  • ptimal data center transport while leveraging the

insights offered by existing proposals?

slide-8
SLIDE 8

DC Transports

DCTCP SIGCOMM’10 D2TCP SIGCOMM’12 L2DCT INFOCOM’13 D3 SIGCOMM’11 PDQ SIGCOMM’12 pFabric SIGCOMM’13

Deployment Friendly but Suboptimal Near Optimal but not Deployment Friendly

(Changes in data plane)

Step back and ask

How can we design a deployment friendly and near

  • ptimal data center transport while leveraging the

insights offered by existing proposals?

PASE

slide-9
SLIDE 9

Rest of the Talk …

  • DC Transport Strategies
  • PASE Design
  • Evaluation
slide-10
SLIDE 10

Rest of the Talk …

  • DC Transport Strategies
  • PASE Design
  • Evaluation
slide-11
SLIDE 11

DC Transport Strategies

  • Self-adjusting endpoints

– senders make independent decisions and adjust rate by themselves

  • Arbitration

e.g., TCP, DCTCP, L2DCT e.g., D3, PDQ – a common network entity (e.g., a switch) allocates rates to each flow

  • In-network prioritization

– switches schedule and drop packets based on the packet priority e.g., pFabric

slide-12
SLIDE 12

DC Transport Strategies

  • Self-adjusting endpoints

– senders make independent decisions and adjust rate by themselves

  • Arbitration

e.g., TCP, DCTCP, L2DCT e.g., D3, PDQ

Existing DC transport proposals use

– a common network entity (e.g., a switch) allocates rates to each flow

  • In-network prioritization

– switches schedule and drop packets based on the packet priority e.g., pFabric

Existing DC transport proposals use

  • nly one of these strategies
slide-13
SLIDE 13

Transport Strategies in Isolation

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT Arbitration PDQ, D3 In-network Prioritization pFabric

slide-14
SLIDE 14

Transport Strategies in Isolation

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT Ease of deployment No strict priority scheduling Arbitration PDQ, D3 In-network Prioritization pFabric Arbitration PDQ, D3 In-network Prioritization pFabric

slide-15
SLIDE 15

Transport Strategies in Isolation

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT Ease of deployment No strict priority scheduling

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT Ease of deployment No strict priority scheduling

  • High flow switching

Arbitration PDQ, D3 In-network Prioritization pFabric Arbitration PDQ, D3 In-network Prioritization pFabric Arbitration PDQ, D3 Strict priority scheduling

  • High flow switching
  • verhead
  • Hard to compute

precise rates In-network Prioritization pFabric

slide-16
SLIDE 16

Transport Strategies in Isolation

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT Ease of deployment No strict priority scheduling

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT Ease of deployment No strict priority scheduling

  • High flow switching

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT Ease of deployment No strict priority scheduling

  • High flow switching

Arbitration PDQ, D3 In-network Prioritization pFabric Arbitration PDQ, D3 In-network Prioritization pFabric Arbitration PDQ, D3 Strict priority scheduling

  • High flow switching
  • verhead
  • Hard to compute

precise rates In-network Prioritization pFabric Arbitration PDQ, D3 Strict priority scheduling

  • High flow switching
  • verhead
  • Hard to compute

precise rates In-network Prioritization pFabric Low flow switching

  • verhead
  • Switch-local decisions
  • Limited # of priority

queues

slide-17
SLIDE 17

Transport Strategies in Unison

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT Ease of deployment No strict priority scheduling

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT Ease of deployment No strict priority scheduling

  • High flow switching

Transport Strategy Example Pros Cons

Self- Adjusting Endpoints DCTCP, D2TCP, L2DCT Ease of deployment No strict priority scheduling

  • High flow switching

Arbitration PDQ, D3 In-network Prioritization pFabric Arbitration PDQ, D3 In-network Prioritization pFabric Arbitration PDQ, D3 Strict priority scheduling

  • High flow switching
  • verhead
  • Hard to compute

precise rates In-network Prioritization pFabric Arbitration PDQ, D3 Strict priority scheduling

  • High flow switching
  • verhead
  • Hard to compute

precise rates In-network Prioritization pFabric Low flow switching

  • verhead
  • Switch-local decisions
  • Limited # of priority

queues

slide-18
SLIDE 18

Transport Strategies in Unison

In-network Prioritization Alone

High Priority

Limited # of queues

More # of flows (priorities)

High Priority Low Priority 1 2 3 4

Flows

slide-19
SLIDE 19

Transport Strategies in Unison

In-network Prioritization Alone

High Priority

Limited # of queues

More # of flows (priorities)

Flow Multiplexing

Limited performance gains!

High Priority Low Priority 1 2 3 4

Flows

Any static mapping mechanism degrades performance!

slide-20
SLIDE 20

Transport Strategies in Unison

In-network Prioritization + Arbitration

Arbitrator

Dynamic mapping of flows to queues

Idea

As a flow’s turn comes, map it to the highest priority queue!

slide-21
SLIDE 21

Transport Strategies in Unison

In-network Prioritization + Arbitration

Arbitrator

Dynamic mapping of flows to queues

High Priority

Idea

As a flow’s turn comes, map it to the highest priority queue!

High Priority Low Priority Flows 1 2 3 4

Time t1

Arbitrator

slide-22
SLIDE 22

Transport Strategies in Unison

In-network Prioritization + Arbitration

Arbitrator

Dynamic mapping of flows to queues

High Priority High Priority

Idea

As a flow’s turn comes, map it to the highest priority queue!

High Priority Low Priority Flows 1 2 3 4

Time t1 Time t2

High Priority Low Priority Flows 2 3 4 Arbitrator Arbitrator

slide-23
SLIDE 23

Transport Strategies in Unison

In-network Prioritization + Arbitration

Arbitrator

Dynamic mapping of flows to queues

High Priority High Priority

Idea

As a flow’s turn comes, map it to the highest priority queue!

Similarly,

  • Arbitration + Self-Adjusting Endpoints

High Priority Low Priority Flows 1 2 3 4

Time t1 Time t2

High Priority Low Priority Flows 2 3 4 Arbitrator Arbitrator

  • Arbitration + Self-Adjusting Endpoints
  • Arbitration + In-network Prioritization

PASE leverages these insights in its design!

slide-24
SLIDE 24

Rest of the Talk …

  • DC Transport Strategies
  • PASE Design
  • Evaluation
slide-25
SLIDE 25

PASE Design Principle

Each transport strategy should focus on what it is best at doing!

  • Arbitrators

– Do inter-flow prioritization at coarse time-scales

  • Endpoints

– Probe for any spare link capacity

  • In-network prioritization

– Do per-packet prioritization at sub-RTT timescales

slide-26
SLIDE 26

PASE Overview

Sender Receiver Arbitrator

  

slide-27
SLIDE 27

PASE Overview

Sender Receiver Arbitrator

  • Arbitration: Control plane

Calculate “reference rate” and “priority queue”  

slide-28
SLIDE 28

PASE Overview

Sender Receiver

Feedback

Arbitrator

  • Arbitration: Control plane

Calculate “reference rate” and “priority queue”

  • Self-Adjusting Endpoints: Guided rate control

Use arbitrator feedback as a pivot 

slide-29
SLIDE 29

PASE Overview

Sender Receiver

Feedback

Arbitrator

  • Arbitration: Control plane

Calculate “reference rate” and “priority queue”

  • Self-Adjusting Endpoints: Guided rate control

Use arbitrator feedback as a pivot

  • In-network Prioritization: Existing priority queues
slide-30
SLIDE 30

PASE Overview

Sender Receiver

Feedback

Arbitrator

  • Arbitration: Control plane

Calculate “reference rate” and “priority queue”

  • Self-Adjusting Endpoints: Guided rate control

Use arbitrator feedback as a pivot

  • In-network Prioritization: Existing priority queues

Key Components

slide-31
SLIDE 31

PASE Arbitration

Sender Receiver Arbitrator

slide-32
SLIDE 32

PASE Arbitration

Sender Receiver Arbitrator Arbitrator Arbitrator

Distributed Arbitration

  • per link arbitration done in

control plane

  • existing protocols implement

in data plane Arbitrator

slide-33
SLIDE 33

PASE Arbitration

Sender Receiver Arbitrator Arbitrator Arbitrator

Distributed Arbitration

  • per link arbitration done in

control plane

  • existing protocols implement

in data plane

Arbitrator Location

  • at the end hosts (e.g., for their
  • wn links to the switch) OR
  • n dedicated hosts inside the

DC Arbitrator

slide-34
SLIDE 34

PASE Arbitration

Sender Receiver

Feedback Feedback Feedback

Arbitrator Arbitrator Arbitrator

Distributed Arbitration

  • per link arbitration done in

control plane

  • existing protocols implement

in data plane

Arbitrator Location

  • at the end hosts (e.g., for their
  • wn links to the switch) OR
  • n dedicated hosts inside the

DC Arbitrator

slide-35
SLIDE 35

PASE Arbitration

Sender Receiver

Feedback Feedback Sends data with min priority Feedback

Arbitrator Arbitrator Arbitrator

Distributed Arbitration

  • per link arbitration done in

control plane

  • existing protocols implement

in data plane

Arbitrator Location

  • at the end hosts (e.g., for their
  • wn links to the switch) OR
  • n dedicated hosts inside the

DC

priority

Arbitrator

slide-36
SLIDE 36

PASE Arbitration – Challenges

  • Challenges

– Arbitration latency – Processing overhead – Network overhead

slide-37
SLIDE 37

PASE Arbitration – Challenges

  • Challenges

– Arbitration latency – Processing overhead – Network overhead

Solution: Leverage the tree-like structure

  • f typical DC topologies
slide-38
SLIDE 38

Bottom Up Arbitration

  • Leverage Tree Structure

from leaves up to the root

slide-39
SLIDE 39

ToR Aggregation Core ToR Aggregation

Bottom Up Arbitration

  • Leverage Tree Structure

from leaves up to the root

Sender Receiver

Inter-Rack

slide-40
SLIDE 40

ToR Aggregation Core ToR Aggregation

Bottom Up Arbitration

  • Leverage Tree Structure

from leaves up to the root

Sender Receiver Arbitration Message

Inter-Rack

slide-41
SLIDE 41

ToR Aggregation Core ToR Aggregation

Bottom Up Arbitration

  • Leverage Tree Structure

from leaves up to the root

Sender Receiver Arbitration Message

Inter-Rack

slide-42
SLIDE 42

ToR Aggregation Core ToR Aggregation

Bottom Up Arbitration

  • Leverage Tree Structure

from leaves up to the root

Sender Receiver Arbitration Message Receiver Response

Inter-Rack

slide-43
SLIDE 43

ToR Aggregation Core ToR Aggregation

Bottom Up Arbitration

  • Leverage Tree Structure

from leaves up to the root

ToR Sender Receiver Arbitration Message Receiver Response

Intra-Rack Inter-Rack No external arbitrators required!

Sender Receiver

slide-44
SLIDE 44

ToR Aggregation Core ToR Aggregation

Bottom Up Arbitration

  • Leverage Tree Structure

from leaves up to the root

ToR Sender Receiver

Facilitates inter-rack optimizations (early pruning & delegation) to reduce arbitration overhead.

Arbitration Message Receiver Response

Intra-Rack Inter-Rack No external arbitrators required!

Sender Receiver

slide-45
SLIDE 45

Early Pruning

Arbitration involves sorting flows and picking top k for immediate scheduling

Agg Core k k TOR

Flows that won’t make it to top k queues should be pruned at lower levels

k k k k

slide-46
SLIDE 46

Early Pruning

Arbitration involves sorting flows and picking top k for immediate scheduling

Agg Core k k

Reduces Network and Processing overhead

TOR

Flows that won’t make it to top k queues should be pruned at lower levels

k k k k

Reduces Network and Processing overhead Fewer flows contact the higher level arbitrators!

slide-47
SLIDE 47

Delegation

Aggregation Core

Key Idea: Divide a link into virtual links and delegate responsibility to child arbitrators

ToRs

slide-48
SLIDE 48

Delegation

  • Algorithm

Link capacity C is split in N virtual links

Aggregation Core

C

Link Capacity

Key Idea: Divide a link into virtual links and delegate responsibility to child arbitrators

ToRs

slide-49
SLIDE 49

Delegation

  • Algorithm

Link capacity C is split in N virtual links

Aggregation Core

C

Link Capacity Delegated Capacities

Key Idea: Divide a link into virtual links and delegate responsibility to child arbitrators Parent arbitrator delegates virtual link to child arbitrator

ToRs

a1 a2 aN

Delegated Capacities

slide-50
SLIDE 50

Delegation

  • Algorithm

Link capacity C is split in N virtual links

Aggregation Core

C

Link Capacity Delegated Capacities

Key Idea: Divide a link into virtual links and delegate responsibility to child arbitrators Parent arbitrator delegates virtual link to child arbitrator

ToRs

a1 a2 aN

Delegated Capacities

Child arbitrator does arbitration for virtual link

slide-51
SLIDE 51

Delegation

  • Algorithm

Link capacity C is split in N virtual links

Aggregation Core

C

Link Capacity Delegated Capacities

Key Idea: Divide a link into virtual links and delegate responsibility to child arbitrators Parent arbitrator delegates virtual link to child arbitrator Virtual link capacity is periodically updated based on the top k flows of all child arbitrators

ToRs

a1 a2 aN

Delegated Capacities

Child arbitrator does arbitration for virtual link

slide-52
SLIDE 52

Delegation

  • Algorithm

Link capacity C is split in N virtual links

Aggregation Core

C

Link Capacity Delegated Capacities

Key Idea: Divide a link into virtual links and delegate responsibility to child arbitrators Reduces Arbitration Latency Parent arbitrator delegates virtual link to child arbitrator Virtual link capacity is periodically updated based on the top k flows of all child arbitrators

ToRs

a1 a2 aN

Delegated Capacities

Child arbitrator does arbitration for virtual link Reduces Arbitration Latency Make arbitration decision close to the sources

slide-53
SLIDE 53

PASE Overview

Sender Receiver

Feedback

Arbitrator

  • Arbitration: Control plane

Calculate “reference rate” and “priority queue”

  • Self-Adjusting Endpoints: Guided rate control

Use arbitrator feedback as a pivot

  • In-network Prioritization: Existing priority queues
slide-54
SLIDE 54

PASE Endhost Transport

  • Rate Control
  • Loss Recovery Mechanism
slide-55
SLIDE 55

PASE Endhost Transport

  • Rate Control

Use reference rate and priority feedback from arbitrators – Use reference-rate as pivot, and – Follow DCTCP control laws – Follow DCTCP control laws

  • Loss Recovery Mechanism
slide-56
SLIDE 56

PASE Endhost Transport

  • Rate Control

Use reference rate and priority feedback from arbitrators – Use reference-rate as pivot, and – Follow DCTCP control laws – Follow DCTCP control laws

  • Loss Recovery Mechanism

– Packets in lower priority queues can be delayed for several RTTs

– large RTO OR small probe to avoid spurious retransmissions

slide-57
SLIDE 57

PASE -- Putting it Together

Sender Receiver

Feedback Feedback Feedback

Arbitrator Arbitrator Arbitrator

  • Efficient arbitration control plane
  • Simple TCP-like transport
  • Existing priority queues inside switches
slide-58
SLIDE 58

Rest of the Talk …

  • DC Transport Strategies
  • PASE Design
  • Evaluation
slide-59
SLIDE 59

Evaluation

  • Platforms

– Small scale testbed – NS2

  • Workloads

– Web search (DCTCP), Data mining (VL2)

  • Comparison with deployment friendly

– DCTCP, D2TCP, L2DCT

  • Comparison with state of the art

– pFabric

slide-60
SLIDE 60

Simulation Setup

Queue Size 250KB (per queue) RTT 300usec RTO 1 msec L 40

slide-61
SLIDE 61

Comparison with Deployment Friendly

Settings similar to D2TCP

  • Flow Sizes: 100-500KB
  • Deadlines: 5-25msec
slide-62
SLIDE 62

Comparison with Deployment Friendly

Settings similar to D2TCP

  • Flow Sizes: 100-500KB
  • Deadlines: 5-25msec

PASE is deployment friendly yet performs BETTER than existing protocols!

slide-63
SLIDE 63

Comparison with State of the Art

Settings

  • Flow Sizes: 2-98KB
  • Left-to-right traffic

percentile 99th

slide-64
SLIDE 64

Comparison with State of the Art

Settings

  • Flow Sizes: 2-98KB
  • Left-to-right traffic

percentile

PASE performs comparable and does not require changes to data plane

99th

slide-65
SLIDE 65

Summary

  • Key Strategies for Existing DC Transport

– Arbitration, in-network Prioritization, Self-Adjusting End- points – Complimentary rather than substitutes

  • PASE

– Combines the three strategies – Combines the three strategies – Efficient arbitration control plane; simple TCP-like transport; leverages existing priority queues inside switches

  • Performance

– Comparable to or better than earlier proposals that even require changes to the network fabric

slide-66
SLIDE 66

Thank you! Thank you!