Reex amining Direct Cache Access to Optimize I/O Intensive - - PowerPoint PPT Presentation

reex amining direct cache access to optimize i o
SMART_READER_LITE
LIVE PREVIEW

Reex amining Direct Cache Access to Optimize I/O Intensive - - PowerPoint PPT Presentation

Reex amining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Netw orks Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kosti * KTH Royal Institute of Technology, School of Electrical


slide-1
SLIDE 1

Reex amining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Netw

  • rks

Alireza Farshin*, Amir Roozbeh*+, Gerald Q. Maguire Jr.*, Dejan Kostić

*KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science (EECS) + Ericsson Research

slide-2
SLIDE 2

Traditional I/O

2020-07-02 2

I/O Device * Direct Memory Access (DMA)

  • 1. I/O device DMAs* packets to

main memory

  • 2. CPU later fetches them to cache
slide-3
SLIDE 3

Traditional I/O

2020-07-02 3

I/O Device

  • 1. I/O device DMAs* packets to

main memory

  • 2. CPU later fetches them to cache

Inefficient:

  • Large number of accesses to main

memory

  • High access latency (>60ns)
  • Unnecessary memory bandwidth usage

* Direct Memory Access (DMA)

slide-4
SLIDE 4

Direct Cache Access (DCA)

2020-07-02 4

I/O Device * PCIe Transaction protocol Processing Hint (TPH)

  • 1. I/O device DMAs packets to main

memory

  • 2. DCA exploits TPH* to prefetch a

portion of packets into cache

  • 3. CPU later fetches them from cache

Prefetch

slide-5
SLIDE 5

Direct Cache Access (DCA)

2020-07-02 5

I/O Device * PCIe Transaction protocol Processing Hint (TPH)

  • Still inefficient in terms of memory bandwidth usage
  • Requires OS intervention and support from processor
  • 1. I/O device DMAs packets to main

memory

  • 2. DCA exploits TPH* to prefetch a

portion of packets into cache

  • 3. CPU later fetches them from cache

Prefetch

slide-6
SLIDE 6

Intel Data Direct I/O (DDIO)

2020-07-02 6

I/O Device

  • DDIO in Xeon processors since

Xeon E5

  • DMA packets or descriptors

directly to/from Last Level Cache (LLC)

slide-7
SLIDE 7

Trends

2020-07-02 7

More in-network computing + offloading capabilities Push costly calculations into the network and perform state teful functions at the processor, which makes applications more I/O intensive.

slide-8
SLIDE 8

Pressure from these trends

2020-07-02 8

Multi-hundred-gigabit networks cannot tolerate memory access and interarrival time of packets continues to shrin ink

Every 6.72 ns a new (64-B+20-B*) packet arrives at 100 Gbps

* 7B preamble + 1B start-of-frame delimiter +12B inter-frame gap = 20B

More in-network computing +

  • ffloading capabilities

Faster link speeds

slide-9
SLIDE 9

DCA matters because

2020-07-02 9

Without DCA we are unable to process I/O at line rate, thus increasing packet loss or latency when utilizing multi-hundred-gigabit networks.

slide-10
SLIDE 10

Forw arding Packets at 100 Gbps

2020-07-02 10

Packet Generator Device under Test Forwarding Packets

Intel Xeon Gold 6140 Mellanox ConnectX-5

100 G 100 Gbps

100 Gbps

Each NIC is placed in a PCIe 3.0 16x slot*

200 400 600 800 1000 1200 1400

100 Gbps 200 Gbps

99th Percentile Latency (µs) Rate

* A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth.

slide-11
SLIDE 11

What happens at 200 Gbps?

2020-07-02 11

Packet Generator Device under Test Forwarding Packets

Intel Xeon Gold 6140 Mellanox ConnectX-5

2x100 G 00 Gbps

100 Gbps

Each NIC is placed in a PCIe 3.0 16x slot*

200 400 600 800 1000 1200 1400

100 Gbps 200 Gbps

99th Percentile Latency (µs) Latency of the first NIC, when forwarding at indicated aggregate rate

100 Gbps

When forwarding at 200 Gbps, 30% higher latency for the NIC forwarding at 100 Gbps

30%

* A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth.

slide-12
SLIDE 12

How does DDIO w

  • rk?

2020-07-02 12

Writing packets/descriptors:

C C C C C C C C C C C C

Logical LLC

CPU Socket

DDIO overwrites a cache line if if it is already present in any ny LLC ways (≡ write update or hit)

Sending/Receiving Packets via DDIO Write to the Same cache line Already Present In LLC

slide-13
SLIDE 13

How does DDIO w

  • rk?

2020-07-02 13

Writing packets/descriptors:

C C C C C C C C C C C C

Logical LLC

CPU Socket

DDIO overwrites a cache line if if it is already present in any ny LLC ways (≡ write update or hit)

Sending/Receiving Packets via DDIO

Otherwise, DDIO allocates a cache line in a limited portion of LLC (≡ write allocate or miss)

Not Present In LLC Allocate a cache line

slide-14
SLIDE 14

How does DDIO w

  • rk?

2020-07-02 14

Writing packets/descriptors:

C C C C C C C C C C C C

Logical LLC

CPU Socket

DDIO overwrites a cache line if if it is already present in any ny LLC ways (≡ write update or hit)

Sending/Receiving Packets via DDIO

Otherwise, DDIO allocates a cache line in a limited portion of LLC (≡ write allocate or miss)

Reading packets/descriptors:

NIC reads a cache line if it is already present in any LLC ways (≡ read hit) Otherwise, NIC reads it from main memory (≡ read miss)

slide-15
SLIDE 15

How does DDIO w

  • rk?

2020-07-02 15

C C C C C C C C C C C C

Logical LLC

CPU Socket Sending/Receiving Packets via DDIO

Designed a set of micro-benchmarks to learn about DDIO:

  • Which ways are used for allocation?
  • How does DDIO interact with other

applications?

  • Does DMA via a remote CPU socket

pollute LLC?

slide-16
SLIDE 16

LLC w ays used by DDIO

2020-07-02 16

Logical LLC

C0

Sending/Receiving Packets via DDIO

I/O Application

* Cache Allocation Technology

1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

slide-17
SLIDE 17

LLC w ays used by DDIO

2020-07-02 17

Logical LLC

C0

Sending/Receiving Packets via DDIO

C1

I/O Application Cache-sensitive Application+

Use CAT* to limit code/data

* Cache Allocation Technology

1 2 3 4 5 6 7 8 9 10 11

+ water_nsquared from Splash-3 benchmark

slide-18
SLIDE 18

LLC w ays used by DDIO

2020-07-02 18

Logical LLC

C0

Sending/Receiving Packets via DDIO

C1

I/O Application Cache-sensitive Application+

* Cache Allocation Technology

2 4 6 8 10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum of Cache Misses (Million) Ways Allocated by CAT to the Cache-sensitive Application 1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

slide-19
SLIDE 19

LLC w ays used by DDIO

2020-07-02 19

Logical LLC

C0

Sending/Receiving Packets via DDIO

C1

I/O Application Cache-sensitive Application+

* Cache Allocation Technology

2 4 6 8 10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum of Cache Misses (Million) Ways Allocated by CAT to the Cache-sensitive Application 1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

slide-20
SLIDE 20

LLC w ays used by DDIO

2020-07-02 20

Logical LLC

C0

Sending/Receiving Packets via DDIO

C1

I/O Application Cache-sensitive Application+

* Cache Allocation Technology

2 4 6 8 10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum of Cache Misses (Million) Ways Allocated by CAT to the Cache-sensitive Application 1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

slide-21
SLIDE 21

LLC w ays used by DDIO

2020-07-02 21

Logical LLC

C0

Sending/Receiving Packets via DDIO

C1

I/O Application Cache-sensitive Application+

* Cache Allocation Technology

2 4 6 8 10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum of Cache Misses (Million) Ways Allocated by CAT to the Cache-sensitive Application 1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

slide-22
SLIDE 22

LLC w ays used by DDIO

2020-07-02 22

Logical LLC

C0

Sending/Receiving Packets via DDIO

C1

I/O Application Cache-sensitive Application+

* Cache Allocation Technology

2 4 6 8 10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum of Cache Misses (Million) Ways Allocated by CAT to the Cache-sensitive Application

Contention with code/data causes a rise in the cache misses of the I/O application

1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

slide-23
SLIDE 23

LLC w ays used by DDIO

2020-07-02 23

Logical LLC

C0

Sending/Receiving Packets via DDIO

C1

I/O Application Cache-sensitive Application+

* Cache Allocation Technology

2 4 6 8 10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum of Cache Misses (Million) Ways Allocated by CAT to the Cache-sensitive Application 1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

slide-24
SLIDE 24

LLC w ays used by DDIO

2020-07-02 24

Logical LLC

C0

Sending/Receiving Packets via DDIO

C1

I/O Application Cache-sensitive Application+

* Cache Allocation Technology

2 4 6 8 10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum of Cache Misses (Million) Ways Allocated by CAT to the Cache-sensitive Application 1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

slide-25
SLIDE 25

LLC w ays used by DDIO

2020-07-02 25

Logical LLC

C0

Sending/Receiving Packets via DDIO

C1

I/O Application Cache-sensitive Application+

* Cache Allocation Technology

2 4 6 8 10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum of Cache Misses (Million) Ways Allocated by CAT to the Cache-sensitive Application 1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

slide-26
SLIDE 26

LLC w ays used by DDIO

2020-07-02 26

Logical LLC

C0

Sending/Receiving Packets via DDIO

C1

I/O Application Cache-sensitive Application+

* Cache Allocation Technology

2 4 6 8 10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum of Cache Misses (Million) Ways Allocated by CAT to the Cache-sensitive Application 1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

slide-27
SLIDE 27

LLC w ays used by DDIO

2020-07-02 27

Logical LLC

C0

Sending/Receiving Packets via DDIO

C1

I/O Application Cache-sensitive Application+

* Cache Allocation Technology

2 4 6 8 10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum of Cache Misses (Million) Ways Allocated by CAT to the Cache-sensitive Application

Contention with I/O causes a rise in the cache misses of the I/O application

1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

slide-28
SLIDE 28

LLC w ays used by DDIO

2020-07-02 28

Logical LLC

C0

Sending/Receiving Packets via DDIO

C1

I/O Application Cache-sensitive Application+

* Cache Allocation Technology

2 4 6 8 10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum of Cache Misses (Million) Ways Allocated by CAT to the Cache-sensitive Application

Contention with I/O causes a rise in the cache misses of the I/O application

1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

See our paper

+ water_nsquared from Splash-3 benchmark

DDIO Ways

slide-29
SLIDE 29

How does DDIO perform?

2020-07-02 29

DDIO canno nnot provide expected benefits!

  • ResQ* [NSDI’18]
  • Intel reports

Write-allocate DDIO could evict not not-yet et-proces essed ed and already-processed packets from LLC Packet should be read from main memory rather than LLC

Reduce the number

  • f RX descriptors

so that the buffer fit in the limited DDIO portion.

* ResQ: Enabling SLOs in Network Function Virtualization

slide-30
SLIDE 30

Reducing #Descriptors is Not Sufficient! (1/2)

2020-07-02 30

Increasing the number of RX descriptors and packet size adversely affects the performance

  • f DDIO

DDIO cannot use the whole reserved capacity in LLC 375 K 375 KB ≪ 4. 4.5 M 5 MB*

* DDIO uses 2 ways out of 11 ways, i.e., 24.75 MB x 2 / 11 = 4.5 MB

slide-31
SLIDE 31

2020-07-02 31

Reducing #Descriptors is Not Sufficient! (2/2)

DDIO should be able to perform well with high number

  • f RX descriptors!

20 40 60 80 100 2 4 6 8 10 12 14 16 18 DDIO Write - Hit Rate (%) Number of Cores

Increasing the number of cores does not always improve PCIe metrics for an I/O intensive application.

Forwarding 1500-B Packets at 100 Gbps with 256 per-core RX descriptors 1500-B Packets x 256 x 18 ≈ 6.59 M MB >> 4.5 MB

slide-32
SLIDE 32

Tuning a little-discussed register can improve the performance of DDIO

2020-07-02 32

Logical LLC

1 1 0 0 0 0 0 0 0 0 0

IIO LLC WAYS Register

Default value is 0x600

Increasing the number of bits set improves DDIO hit rates.

slide-33
SLIDE 33

DDIO’s effect on hit rates can affect application-level performance based on an application’s characteristics

2020-07-02 33

Impact of Tuning DDIO

s)

300 600 900 1200 1500 1800

512 1024 2048 4096 99th Percentile La tency ( µ

Number of RX Descri ptors

2 bits 4 bits 6 bits 8 bits

For example, an I/O intensive application: 2 cores forwarding 1500-B Packets at 100 Gbps

slide-34
SLIDE 34

DDIO’s effect on hit rates can affect application-level performance based on an application’s characteristics

2020-07-02 34

Impact of Tuning DDIO

s)

300 600 900 1200 1500 1800

512 1024 2048 4096 99th Percentile La tency ( µ

Number of RX Descri ptors

2 bits 4 bits 6 bits 8 bits

For example, an I/O intensive application: 2 cores forwarding 1500-B Packets at 100 Gbps

Setting more bits reduces tail latency

30%

slide-35
SLIDE 35

Is Tuning DDIO Enough?

2020-07-02 35

Tuning is not not a perfect solution! Due to:

  • Cache is used for code/data,
  • Smaller per-core cache quota, and
  • Coarse-grained partitions.

Next generation DCA should provide: Fi Fine-grai aine ned p placement: Similar to CacheDirector* [EuroSys’19] I/O i isolat ation

  • n: Extend CAT+ and CDP++ to include I/O

Selec ective D e DCA/DMA: only transfer relevant parts of the packet to LLC

* Make the Most out of Last Level Cache in Intel Processors + Cache Allocation Technology ++ Code/Data Prioritization

slide-36
SLIDE 36

What about Current Systems?

2020-07-02 36

DMA should not not be directed to the cache if this would cause I/O evictions!

  • Disabling DDIO for a specific PCIe port
  • Exploiting a remote socket

Bypassing ng c cache he is beneficial in multi-tenant/application environment, where some performance isolation is desired.

slide-37
SLIDE 37

Using Our Know ledge for 200 Gbps

2020-07-02 37

Device under Test Forwarding Packets

100 Gbps 100 Gbps

UPI

200 400 600 800 1000 1200 1400

100 Gbps 200 Gbps

99thPercentile Latency (µs) Latency of the first NIC versus aggregate Rate

200 Gbps

(4 bits)

200 Gbps

(RemoteSocket)

200 Gbps

(Disable)

Tuning DDIO improves packet processing at 200 Gbps

100 Gbps

Better cache management is necessary for multi-hundred-gigabit-per-second networks

slide-38
SLIDE 38

2020-07-02 38

Other Insights

See our paper for more results about:

  • How does receiving rate affect the DDIO performance?
  • How does processing time affect the DDIO performance?
  • Is DDIO always beneficial?
  • Scaling up and DDIO.

We study the performance of DDIO in different scenarios

slide-39
SLIDE 39

2020-07-02 39

Our Key Findings (1/2)

  • If an application is I/O bound, adding excessive cores could degrade its

performance.

  • If an application is I/O bound, tuning a little-discussed register called

IIO IIO L LLC W C WAYS could improve performance and lead to the same improvements as adding more cores.

  • If an application starts to become CPU bound, adding more cores could

improve its throughput, but it is important to balance load among cores to maximize DDIO’s benefits.

  • Getting close to ~100 Gbps can cause DDIO to become a bottleneck.

Therefore, it is essential to know when to bypass the cache to realize performance isolation.

slide-40
SLIDE 40

2020-07-02 40

Our Key Findings (2/2)

  • If an application is truly CPU/memory bound, tuning DDIO is less efficient.

We now explain the impact of processing time

  • n the performance DDIO, which resulted in this finding.
slide-41
SLIDE 41

2020-07-02 41

Impact of Processing Time

100

  • Hit Rate

DDIO Metric 20 40 60 80 100 10 20 30 40 50 60 70 80 90 (%) Number of Calls

Write Read Throughput

Device under Test

Input Packet Output Packet

Swapping MAC Calling Random Number Generator (std::mt1993)

Increasing processing time improves DDIO performance

Increasing processing time improves the performance of DDIO

slide-42
SLIDE 42

100 20 40 60 80 100 10 20 30 40 50 60 70 80 90 Number of Calls

Write Read Throughput

2020-07-02 42

Impact of Processing Time

20 40 60 80 100 Throughput (Gbps)

Device under Test

Input Packet Output Packet

Swapping MAC Calling Random Number Generator (std::mt1993)

DDIO performance matters m most when an application is I/ I/O boun und, rather than CPU/memory bound.

Increasing processing time reduces throughput

  • Hit Rate

DDIO Metric (%)

slide-43
SLIDE 43
  • DCA/DDIO should be tuned for I/O intensive

applications.

  • DCA/DDIO needs to be rearchitected for

multi-hundred-gigabit networks.

  • Benchmark your testbed with our source

code.

Conclusion

2020-07-02 43

https://github.com/aliireza/ddio-bench

This work is supported by ERC, SSF , and WASP .

slide-44
SLIDE 44

2020-07-02 44

Thanks for listening

Do not hesitate to contact us if you have any questions. farshin@kth.se and amirrsk@kth.se