Loom: Flexible and Efficient NIC Packet Scheduling NSDI 2019 Brent - - PowerPoint PPT Presentation

loom flexible and efficient nic packet scheduling
SMART_READER_LITE
LIVE PREVIEW

Loom: Flexible and Efficient NIC Packet Scheduling NSDI 2019 Brent - - PowerPoint PPT Presentation

Loom: Flexible and Efficient NIC Packet Scheduling NSDI 2019 Brent Stephens Aditya Akella, Mike Swift Loom is a new Network Interface Card (NIC) design that offloads all per-flow scheduling decisions out of the OS and into the NIC Why is


slide-1
SLIDE 1

Brent Stephens Aditya Akella, Mike Swift

NSDI 2019

Loom: Flexible and Efficient NIC Packet Scheduling

slide-2
SLIDE 2

Loom is a new Network Interface Card (NIC) design that offloads all per-flow scheduling decisions out of the OS and into the NIC

  • Why is packet scheduling important?
  • What is wrong with current NICs?
  • Why should all packet scheduling be
  • ffloaded to the NIC?

42

slide-3
SLIDE 3

Why is packet scheduling important?

43

slide-4
SLIDE 4

Collocation (Application and Tenant) is Important for Infrastructure Efficiency

44

Tenant 1 Tenant 2

CPU Isolation Policy:

Tenant 1: Memcached: 3 cores Spark: 1 core Tenant 2: Spark: 4 cores

slide-5
SLIDE 5

45

Network Performance Goals

Different applications have differing network performance goals

Low Latency High Throughput

slide-6
SLIDE 6

46

Network Policies

Network operators must specify and enforce a network isolation policy

  • Enforcing a network isolation policy requires scheduling

Pri_1

VM1 VM1

Pseudocode

Tenant_1.Memcached -> Pri_1:high Tenant_1.Spark -> Pri_1:low Pri_1 -> RL_WAN(Dst == WAN: 15Gbps) Pri_1 -> RL_None(Dst != WAN: No Limit) RL_WAN -> FIFO_1; RL_None -> FIFO_1 FIFO_1-> Fair_1:w1 Tenants_2.Spark -> Fair_1:w1 Fair_1 -> Wire

slide-7
SLIDE 7

47

Network Policies

Network operators must specify and enforce a network isolation policy

  • Enforcing a network isolation policy requires scheduling

Pri_1

VM1 VM1

Pseudocode

Tenant_1.Memcached -> Pri_1:high Tenant_1.Spark -> Pri_1:low Pri_1 -> RL_WAN(Dst == WAN: 15Gbps) Pri_1 -> RL_None(Dst != WAN: No Limit) RL_WAN -> FIFO_1; RL_None -> FIFO_1 FIFO_1-> Fair_1:w1 Tenants_2.Spark -> Fair_1:w1 Fair_1 -> Wire

FIFO_1 RL_WAN RL_None

slide-8
SLIDE 8

48

Wire

Network Policies

Network operators must specify and enforce a network isolation policy

  • Enforcing a network isolation policy requires scheduling

Fair_1 Pri_1

VM1 VM2 VM1

Pseudocode

Tenant_1.Memcached -> Pri_1:high Tenant_1.Spark -> Pri_1:low Pri_1 -> RL_WAN(Dst == WAN: 15Gbps) Pri_1 -> RL_None(Dst != WAN: No Limit) RL_WAN -> FIFO_1; RL_None -> FIFO_1 FIFO_1-> Fair_1:w1 Tenants_2.Spark -> Fair_1:w1 Fair_1 -> Wire

FIFO_1 RL_WAN RL_None

slide-9
SLIDE 9

What is wrong with current NICs?

49

slide-10
SLIDE 10

Single Queue Packet Scheduling Limitations

  • Single core throughput is limited

(although high with Eiffel)

  • Especially with very small packets
  • Energy-efficient architectures may

prioritize scalability over single-core performance

  • Software scheduling consumes CPU
  • Core-to-core communication

increases latency

50

CPU NIC

Wire

App 1 App 2 NIC

SQ struggles to drive line-rate

slide-11
SLIDE 11

Multi Queue NIC Background and Limitations

  • Multi-queue NICs enable parallelism
  • Throughput can be scaled across many

tens of cores

  • Multi-queue NICs have packet

scheduler that chose which queue to send packets from

  • The one-queue-per-core multi-queue

model (MQ) attempts to enforces the policy at every core independently

  • This is the best possible without inter-

core coordination, but it is not effective

51

CPU NIC

Wire

App 1 App 2 NIC

MQ struggles to enforce policies!

slide-12
SLIDE 12

MQ Scheduler Problems

52

CPU NIC

(Network Interface Card)

Time (t)

Naïve NIC packet scheduling prevents colocation! It leads to:

  • High latency
  • Unfair and variable

throughput Packet Scheduler

slide-13
SLIDE 13

MQ Scheduler Problems

53

CPU NIC

(Network Interface Card)

Time (t)

Naïve NIC packet scheduling prevents colocation! It leads to:

  • High latency
  • Unfair and variable

throughput Packet Scheduler

slide-14
SLIDE 14

Why should all packet scheduling be

  • ffloaded to the NIC?

54

slide-15
SLIDE 15

55

Where to divide labor between the OS and NIC?

CPU NIC

Wire Fair_1 Pri_1

VM1 VM2 VM1

FIFO_1 RL_WAN RL_None

slide-16
SLIDE 16

56

Where to divide labor between the OS and NIC?

CPU NIC

Option 1: Single Queue (SQ)

  • Enforce entire policy in software
  • Low Tput/High CPU Utilization

Wire Fair_1 Pri_1

VM1 VM2 VM1

FIFO_1 RL_WAN RL_None

slide-17
SLIDE 17

57

Where to divide labor between the OS and NIC?

CPU NIC

Option 1: Single Queue (SQ)

  • Enforce entire policy in software
  • Low Tput/High CPU Utilization

Option 2: Multi Queue (MQ)

  • Every core independently enforces

policy on local traffic

  • Cannot ensure polices are

enforced

Wire Fair_1 Pri_1

VM1 VM2 VM1

FIFO_1 RL_WAN RL_None

slide-18
SLIDE 18

58

Where to divide labor between the OS and NIC?

CPU NIC

Option 1: Single Queue (SQ)

  • Enforce entire policy in software
  • Low Tput/High CPU Utilization

Option 2: Multi Queue (MQ)

  • Every core independently enforces

policy on local traffic

  • Cannot ensure polices are

enforced

Option 3: Loom

  • Every flow uses its own queue
  • All policy enforcement is offloaded to

the NIC

  • Precise policy + low CPU

Wire Fair_1 Pri_1

VM1 VM2 VM1

FIFO_1 RL_WAN RL_None

slide-19
SLIDE 19

59

Loom is a new NIC design that moves all per-flow scheduling decisions out

  • f the OS and into the NIC

Loom uses a queue per flow and offloads all packet scheduling to the NIC

slide-20
SLIDE 20

Core Problem:

It is not currently possible to offload all packet scheduling because NIC packet schedulers are in infle lexib ible le and configuring them is in ineffic icie ient

slide-21
SLIDE 21

Core Problem:

It is not currently possible to offload all packet scheduling because NIC packet schedulers are in infle lexib ible le and configuring them is in ineffic icie ient

NIC packet schedulers are currently standing in the way of performance isolation!

slide-22
SLIDE 22

Outline

62

Intro: Loom is a new NIC design that moves all per-flow scheduling decisions out of the OS and into the NIC

Contributions:

Specification: A new network policy abstraction: restricted directed acyclic graphs (DAGs) Enforcement: A new programmable packet scheduling hierarchy designed for NICs Updating: A new expressive and efficient OS/NIC interface

Implementation and Evaluation: BESS prototype and CloudLab

slide-23
SLIDE 23

Outline

Contributions:

  • 1. Specification: A new network policy

abstraction: restricted directed acyclic graphs (DAGs)

  • 2. Enforcement: A new programmable packet

scheduling hierarchy designed for NICs

  • 3. Updating: A new expressive and efficient OS/NIC

interface

63

slide-24
SLIDE 24

What scheduling polices are needed for performance isolation? How should policies be specified?

64

slide-25
SLIDE 25

Solution: Loom Policy DAG

Two types of nodes:

65

Wire Fair_1 Pri_1

VM1 VM2 VM1

FIFO_1 RL_WAN RL_None Shaping Node Scheduling Node

Scheduling nodes: Work-conserving policies for sharing the local link bandwidth Shaping nodes: Rate-limiting policies for sharing the network core (WAN and DCN) Programmability: Every node is programmable with a custom enqueue and dequeue function

Loom can express policies that cannot be expressed with either Linux Traffic Control (Qdisc) or with Domino (PIFO)! Important systems like BwE (sharing the WAN) and EyeQ (sharing the DCN) require Loom’s policy DAG!

slide-26
SLIDE 26

Types of Loom Scheduling Policies:

66

Scheduling:

  • All of the flows from competing

Spark jobs J1 and J2 in VM1 fairly share network bandwidth

Shaping:

  • All of the flows from VM1 to VM2 are

rate limited to 50Gbps

slide-27
SLIDE 27

Types of Loom Scheduling Policies:

67

Scheduling:

  • All of the flows from competing

Spark jobs J1 and J2 in VM1 fairly share network bandwidth

Shaping:

  • All of the flows from VM1 to VM2 are

rate limited to 50Gbps

Group by source Group by destination

slide-28
SLIDE 28

Types of Loom Scheduling Policies:

68

Scheduling:

  • All of the flows from competing

Spark jobs J1 and J2 in VM1 fairly share network bandwidth

Shaping:

  • All of the flows from VM1 to VM2 are

rate limited to 50Gbps

Because Scheduling and Shaping polices may aggregate flows differently, they cannot be expressed as a tree!

Group by source Group by destination

slide-29
SLIDE 29

69

Loom: Policy Abstraction

Policies are expressed as restricted acyclic graphs (DAGs)

Legend:

Shaping Node Scheduling Node

Child 1

(a)

Child 2 Parent P1 P2 Child

(b)

Child

(c)

FIFO

R1 R2 R3

Child

(d)

P1

R1 R2 R3

P2 P3

DAG restriction: Scheduling nodes form a tree when the shaping nodes are removed

(b) And (d) are prevented because they allow parents to reorder packets that were already ordered by a child node.

slide-30
SLIDE 30

70

Loom: Policy Abstraction

Policies are expressed as restricted acyclic graphs (DAGs)

Legend:

Shaping Node Scheduling Node

Child 1

(a)

Child 2 Parent P1 P2 Child

(b)

Child

(c)

FIFO

R1 R2 R3

Child

(d)

P1

R1 R2 R3

P2 P3

DAG restriction: Scheduling nodes form a tree when the shaping nodes are removed

(b) And (d) are prevented because they allow parents to reorder packets that were already ordered by a child node.

slide-31
SLIDE 31

Outline

Contributions:

  • 1. Specification: A new network policy abstraction:

restricted directed acyclic graphs (DAGs)

  • 2. Enforcement: A new programmable packet

scheduling hierarchy designed for NICs

  • 3. Updating: A new expressive and efficient OS/NIC

interface

71

slide-32
SLIDE 32

How do we build a NIC that can enforce Loom’s new DAG abstraction?

72

slide-33
SLIDE 33

Loom Enforcement Challenge

No existing hardware scheduler can efficiently enforce Loom Policy DAGs

Scheduling

Domino PIFO Block 1 x

Shaping

1 x

Scheduling

New PIFO Block? 1 x

Shaping

N x

73

Requiring separate shaping queues for every shaping traffic class would be prohibitive!

slide-34
SLIDE 34

Insight: All shaping can be done with a single queue because all shaping can use wall clock time as a rank

74

slide-35
SLIDE 35

Loom Enforcement

75

In Loom, scheduling and shaping queues are separate

1. All traffic is first only placed in scheduling queues 2. If a packet is dequeued before its shaping time, it is placed in a global shaping queue 3. After shaping, the packet is placed back in scheduling queues

F1 F2 F3

Pri Mem – 25Gbps RL Mem – No RL Spark – No RL Scheduling Shaping

F1 F1 F1

slide-36
SLIDE 36

Loom Enforcement

76

In Loom, scheduling and shaping queues are separate

1. All traffic is first only placed in scheduling queues 2. If a packet is dequeued before its shaping time, it is placed in a global shaping queue 3. After shaping, the packet is placed back in scheduling queues

F1 F2 F3

Pri Mem – 25Gbps RL Mem – No RL Spark – No RL Scheduling Shaping

F1 F1 F1

slide-37
SLIDE 37

Loom Enforcement

77

In Loom, scheduling and shaping queues are separate

1. All traffic is first only placed in scheduling queues 2. If a packet is dequeued before its shaping time, it is placed in a global shaping queue 3. After shaping, the packet is placed back in scheduling queues

F1 F2 F3

Pri Mem – 25Gbps RL Mem – No RL Spark – No RL Scheduling Shaping

F1 F1 F1 F2 F3

slide-38
SLIDE 38

Loom Enforcement

78

In Loom, scheduling and shaping queues are separate

1. All traffic is first only placed in scheduling queues 2. If a packet is dequeued before its shaping time, it is placed in a global shaping queue 3. After shaping, the packet is placed back in scheduling queues

F1 F2 F3

Pri Mem – 25Gbps RL Mem – No RL Spark – No RL Scheduling Shaping

F1 F1 F1 F2 F3

slide-39
SLIDE 39

Loom Enforcement

79

In Loom, scheduling and shaping queues are separate

1. All traffic is first only placed in scheduling queues 2. If a packet is dequeued before its shaping time, it is placed in a global shaping queue 3. After shaping, the packet is placed back in scheduling queues

F1 F2 F3

Pri Mem – 25Gbps RL Mem – No RL Spark – No RL Scheduling Shaping

F1 F1 F1 F3

slide-40
SLIDE 40

Loom Enforcement

80

In Loom, scheduling and shaping queues are separate

1. All traffic is first only placed in scheduling queues 2. If a packet is dequeued before its shaping time, it is placed in a global shaping queue 3. After shaping, the packet is placed back in scheduling queues

F1 F2 F3

Pri Mem – 25Gbps RL Mem – No RL Spark – No RL Scheduling Shaping

F1 F1 F1 F3

slide-41
SLIDE 41

Loom Enforcement

81

In Loom, scheduling and shaping queues are separate

1. All traffic is first only placed in scheduling queues 2. If a packet is dequeued before its shaping time, it is placed in a global shaping queue 3. After shaping, the packet is placed back in scheduling queues

F1 F2 F3

Pri Mem – 25Gbps RL Mem – No RL Spark – No RL Scheduling Shaping

F1 F1 F1 F3

slide-42
SLIDE 42

Loom Enforcement

82

In Loom, scheduling and shaping queues are separate

1. All traffic is first only placed in scheduling queues 2. If a packet is dequeued before its shaping time, it is placed in a global shaping queue 3. After shaping, the packet is placed back in scheduling queues

F1 F2 F3

Pri Mem – 25Gbps RL Mem – No RL Spark – No RL Scheduling Shaping

F1 F1 F1 F3

slide-43
SLIDE 43

Loom Enforcement

83

In Loom, scheduling and shaping queues are separate

1. All traffic is first only placed in scheduling queues 2. If a packet is dequeued before its shaping time, it is placed in a global shaping queue 3. After shaping, the packet is placed back in scheduling queues

F1 F2 F3

Pri Mem – 25Gbps RL Mem – No RL Spark – No RL Scheduling Shaping

F1 F1 F1 F3

slide-44
SLIDE 44

Outline

Contributions:

  • 1. Specification: A new network policy abstraction:

restricted directed acyclic graphs (DAGs)

  • 2. Enforcement: A new programmable packet

scheduling hierarchy designed for NICs

  • 3. Updating: A new expressive and efficient

OS/NIC interface

84

slide-45
SLIDE 45

PCIe Limitations:

85

NIC doorbell and update limitations:1

NIC

PCIe Engine App Core

DB1 DB2 DB3 DB4

DB_F

PCIe 1 2

1

Latency Limitations:

  • 120-900ns

2

Throughput Limitations:

  • ~3Mops (Intel XL710 40Gbps)

PSPAT: software packet scheduling at hardware speed

Luigi Rizzo1, Paolo Valente2, Giuseppe Lettieri1, Vincenzo Maffione2

  • 1Univ. di Pisa, 2Univ.di Modena e Reggio Emilia

rizzo.unipi@gmail.com. Work supported by H2020 project SSICLOPS. Author’s copy 20160921, please do not redistribute.

PCIe bus NIC Packet scheduler clients kernel protocol processing device driver clients kernel protocol processing device driver multiqueue NIC + HW scheduler PCIe bus

1

slide-46
SLIDE 46

PCIe Limitations:

86

NIC doorbell and update limitations:1

NIC

PCIe Engine App Core

DB1 DB2 DB3 DB4

DB_F

PCIe 1 2

1

Latency Limitations:

  • 120-900ns

2

Throughput Limitations:

  • ~3Mops (Intel XL710 40Gbps)

PSPAT: software packet scheduling at hardware speed

Luigi Rizzo1, Paolo Valente2, Giuseppe Lettieri1, Vincenzo Maffione2

  • 1Univ. di Pisa, 2Univ.di Modena e Reggio Emilia

rizzo.unipi@gmail.com. Work supported by H2020 project SSICLOPS. Author’s copy 20160921, please do not redistribute.

PCIe bus NIC Packet scheduler clients kernel protocol processing device driver clients kernel protocol processing device driver multiqueue NIC + HW scheduler PCIe bus

1

slide-47
SLIDE 47

PCIe Limitations:

87

NIC doorbell and update limitations:1

NIC

PCIe Engine App Core

DB1 DB2 DB3 DB4

DB_F

PCIe 1 2

1

Latency Limitations:

  • 120-900ns

2

Throughput Limitations:

  • ~3Mops (Intel XL710 40Gbps)

PSPAT: software packet scheduling at hardware speed

Luigi Rizzo1, Paolo Valente2, Giuseppe Lettieri1, Vincenzo Maffione2

  • 1Univ. di Pisa, 2Univ.di Modena e Reggio Emilia

rizzo.unipi@gmail.com. Work supported by H2020 project SSICLOPS. Author’s copy 20160921, please do not redistribute.

PCIe bus NIC Packet scheduler clients kernel protocol processing device driver clients kernel protocol processing device driver multiqueue NIC + HW scheduler PCIe bus

1

slide-48
SLIDE 48

PCIe Limitations:

88

NIC doorbell and update limitations:1

NIC

PCIe Engine App Core

DB1 DB2 DB3 DB4

DB_F

PCIe 1 2

1

Latency Limitations:

  • 120-900ns

2

Throughput Limitations:

  • ~3Mops (Intel XL710 40Gbps)

PSPAT: software packet scheduling at hardware speed

Luigi Rizzo1, Paolo Valente2, Giuseppe Lettieri1, Vincenzo Maffione2

  • 1Univ. di Pisa, 2Univ.di Modena e Reggio Emilia

rizzo.unipi@gmail.com. Work supported by H2020 project SSICLOPS. Author’s copy 20160921, please do not redistribute.

PCIe bus NIC Packet scheduler clients kernel protocol processing device driver clients kernel protocol processing device driver multiqueue NIC + HW scheduler PCIe bus

1

slide-49
SLIDE 49

PCIe Limitations:

89

NIC doorbell and update limitations:1 Loom Goal: Less than 1Mops @ 100Gbps

NIC

PCIe Engine App Core

DB1 DB2 DB3 DB4

DB_F

PCIe 1 2

1

Latency Limitations:

  • 120-900ns

2

Throughput Limitations:

  • ~3Mops (Intel XL710 40Gbps)

PSPAT: software packet scheduling at hardware speed

Luigi Rizzo1, Paolo Valente2, Giuseppe Lettieri1, Vincenzo Maffione2

  • 1Univ. di Pisa, 2Univ.di Modena e Reggio Emilia

rizzo.unipi@gmail.com. Work supported by H2020 project SSICLOPS. Author’s copy 20160921, please do not redistribute.

PCIe bus NIC Packet scheduler clients kernel protocol processing device driver clients kernel protocol processing device driver multiqueue NIC + HW scheduler PCIe bus

1

slide-50
SLIDE 50

Loom Efficient Interface Challenges

90

Insufficient data:

Before reading any packet data (headers), the NIC must schedule DMA reads for a queue

Too many PCIe writes:

In the worst case (every packet is from a new flow), the OS must generate 2 PCIe writes per-packet 2 writes per 1500B packet at 100Gbps = 16.6 Mops!

slide-51
SLIDE 51

Loom Design

91

Loom introduces a new efficient OS/NIC interface that reduces the number of PCIe writes through batched updates and inline metadata

slide-52
SLIDE 52

Batched Doorbells

92

Using on-NIC Doorbell FIFOs allows for updates to different queues (flows) to be batched

NIC

Per-core Doorbell FIFO App Core PCIe

Per-core FIFOs still enable parallelism

slide-53
SLIDE 53

Batched Doorbells

93

Using on-NIC Doorbell FIFOs allows for updates to different queues (flows) to be batched

NIC

Per-core Doorbell FIFO App Core PCIe

Per-core FIFOs still enable parallelism

slide-54
SLIDE 54

Batched Doorbells

94

Using on-NIC Doorbell FIFOs allows for updates to different queues (flows) to be batched

NIC

Per-core Doorbell FIFO App Core PCIe

Per-core FIFOs still enable parallelism

slide-55
SLIDE 55

Batched Doorbells

95

Using on-NIC Doorbell FIFOs allows for updates to different queues (flows) to be batched

NIC

Per-core Doorbell FIFO App Core PCIe

Per-core FIFOs still enable parallelism

slide-56
SLIDE 56

Batched Doorbells

96

Using on-NIC Doorbell FIFOs allows for updates to different queues (flows) to be batched

NIC

Per-core Doorbell FIFO App Core PCIe

Per-core FIFOs still enable parallelism

slide-57
SLIDE 57

Batched Doorbells

97

Using on-NIC Doorbell FIFOs allows for updates to different queues (flows) to be batched

NIC

Per-core Doorbell FIFO App Core PCIe

Per-core FIFOs still enable parallelism

slide-58
SLIDE 58

Batched Doorbells

98

Using on-NIC Doorbell FIFOs allows for updates to different queues (flows) to be batched

NIC

Per-core Doorbell FIFO App Core PCIe

Per-core FIFOs still enable parallelism

slide-59
SLIDE 59

NIC

Inline Metadata

99

Scheduling metadata (traffic class and scheduling updates) is inlined to reduce PCIe writes

Descriptor inlining allows for scheduling before reading packet data

Wire

DMA Engine

PIFOs

Mem

Q5 Q_F … Q1 Q2 Q3

slide-60
SLIDE 60

NIC

Inline Metadata

100

Scheduling metadata (traffic class and scheduling updates) is inlined to reduce PCIe writes

Descriptor inlining allows for scheduling before reading packet data

Wire

DMA Engine

PIFOs

Mem

Q5 Q_F … Q1 Q2 Q3

slide-61
SLIDE 61

NIC

Inline Metadata

101

Scheduling metadata (traffic class and scheduling updates) is inlined to reduce PCIe writes

Descriptor inlining allows for scheduling before reading packet data

Wire

DMA Engine

PIFOs

Mem

Q5 Q_F … Q1 Q2 Q3

slide-62
SLIDE 62

Outline

Contributions:

1. A new network policy abstraction: restricted directed acyclic graphs (DAGs) 2. A new programmable packet scheduling hierarchy designed for NICs 3. A new expressive and efficient OS/NIC interface

Evaluation:

1. Implementation and Evaluation: BESS prototype and CloudLab 102

slide-63
SLIDE 63

Loom Implementation

103

Software prototype of Loom in Linux on the Berkeley Extensible Software Switch (BESS)1

Programmable Packet Scheduling at Line Rate

Anirudh Sivaraman*, Suvinay Subramanian*, Mohammad Alizadeh*, Sharad Chole‡, Shang-Tse Chuang‡, Anurag Agrawal†, Hari Balakrishnan*, Tom Edsall‡, Sachin Katti+, Nick McKeown+

*MIT CSAIL, †Barefoot Networks, ‡Cisco Systems, +Stanford University

http://github.com/bestephe/loom

C++ PIFO2 implementation is used for scheduling 10Gbps and 40Gbps CloudLab evaluation

1 2

slide-64
SLIDE 64

Loom Evaluation

104

Can Loom drive line rate? Can Loom enforce network policies?

Experiment: Microbenchmarks with iPerf

Can Loom isolate real applications?

Experiment: CloudLab experiments with memcached and Spark

How effective is Loom’s efficient OS/NIC interface?

Experiment: Analysis

  • f PCIe writes in Linux

(QPF) versus Loom

slide-65
SLIDE 65

Loom 40Gbps Evaluation

105

Setup:

  • Every 2s a new tenant

starts or stops

  • Each tenant i starts 4i

flows (4-256 total flows)

Fair T2 T1 T3 T4

Policy: All tenants should receive an equal share.

5 10 15 20 Time (seconds) 10 20 30 40 Throughput (Gbps) T1 T2 T3 T4 5 10 15 20 Time (seconds) 10 20 30 40 Throughput (Gbps) T1 T2 T3 T4 5 10 15 20 Time (seconds) 10 20 30 40 Throughput (Gbps) T1 T2 T3 T4

SQ MQ Loom

105

slide-66
SLIDE 66

Loom 40Gbps Evaluation

106

Setup:

  • Every 2s a new tenant

starts or stops

  • Each tenant i starts 4i

flows (4-256 total flows)

Fair T2 T1 T3 T4

Policy: All tenants should receive an equal share.

5 10 15 20 Time (seconds) 10 20 30 40 Throughput (Gbps) T1 T2 T3 T4 5 10 15 20 Time (seconds) 10 20 30 40 Throughput (Gbps) T1 T2 T3 T4 5 10 15 20 Time (seconds) 10 20 30 40 Throughput (Gbps) T1 T2 T3 T4

Loom can drive line-rate and isolate competing tenants and flows SQ MQ Loom

106

slide-67
SLIDE 67

Application Performance: Fairness

107

vs

Bandwidth Hungry Bandwidth Hungry Policy: Bandwidth is fairly shared between Spark jobs

Fair

30 35 40 45 50 Time (seconds) 2 4 6 8 10 Throughput (Gbps)

Job1 Job2

30 35 40 45 50 55 60 Time (seconds) 2 4 6 8 10 Throughput (Gbps)

Job1 Job2

Linux Loom

slide-68
SLIDE 68

Application Performance: Fairness

108

vs

Bandwidth Hungry Bandwidth Hungry Policy: Bandwidth is fairly shared between Spark jobs

Fair

Loom can ensure competing jobs share bandwidth even if they have different numbers of flows

30 35 40 45 50 Time (seconds) 2 4 6 8 10 Throughput (Gbps)

Job1 Job2

30 35 40 45 50 55 60 Time (seconds) 2 4 6 8 10 Throughput (Gbps)

Job1 Job2

Linux Loom

slide-69
SLIDE 69

Application Performance: Latency

109

vs

Latency Sensitive Bandwidth Hungry Setup: Linux software packet scheduling (Qdisc) is configured to prioritize memcached traffic over Spark traffic

Pri

1000 2000 3000 4000 Loom Linux (MQ)

90th Percentile Latency (us)

slide-70
SLIDE 70

Application Performance: Latency

110

vs

Latency Sensitive Bandwidth Hungry Setup: Linux software packet scheduling (Qdisc) is configured to prioritize memcached traffic over Spark traffic

Pri

MQ cannot isolate latency-sensitive applications!

1000 2000 3000 4000 Loom Linux (MQ)

90th Percentile Latency (us)

slide-71
SLIDE 71

Loom Interface Evaluation

111

Line-rate Existing approaches: PCIe Writes per second Loom: PCIe Writes per second 10 Gbps 833K 19K 40 Gbps 3.3M 76K 100 Gbps 8.3M 191K

Worse case scenario: Packets are sent in 64KB batches and each packet is from a different flow

slide-72
SLIDE 72

Loom Interface Evaluation

112

Line-rate Existing approaches: PCIe Writes per second Loom: PCIe Writes per second 10 Gbps 833K 19K 40 Gbps 3.3M 76K 100 Gbps 8.3M 191K

Worse case scenario: Packets are sent in 64KB batches and each packet is from a different flow

Loom Goal: Less than 1Mops @ 100Gbps

slide-73
SLIDE 73

Conclusion

113

Loom is a new NIC design that completely

  • ffloads all packet scheduling to the NIC with

low CPU overhead Loom’s benefits translate into reductions in latency, increases in throughput, and improvements in fairness Current NICs cannot ensure that competing applications are isolated

slide-74
SLIDE 74

Related Work (Eiffel)

114

Eiffel NIC Scheduling does not eliminate the need for software scheduling Loom and Eiffel can be used together Bucketed priority queues could be used to build efficient PIFOs