reex amining direct cache access to optimize i o
play

Reex amining Direct Cache Access to Optimize I/O Intensive - PowerPoint PPT Presentation

Reex amining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Netw orks Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kosti * KTH Royal Institute of Technology, School of Electrical


  1. Reex amining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Netw orks Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kostić * KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science (EECS) + Ericsson Research

  2. Traditional I/O 1. I/O device DMAs* packets to main memory 2. CPU later fetches them to cache I/O Device * Direct Memory Access (DMA) 2020-07-02 2

  3. Traditional I/O 1. I/O device DMAs* packets to main memory 2. CPU later fetches them to cache Inefficient: Large number of accesses to main • memory High access latency (>60ns) • Unnecessary memory bandwidth usage • I/O Device * Direct Memory Access (DMA) 2020-07-02 3

  4. Direct Cache Access (DCA) 1. I/O device DMAs packets to main memory 2. DCA exploits TPH* to prefetch a portion of packets into cache 3. CPU later fetches them from cache Prefetch I/O Device * PCIe Transaction protocol Processing Hint (TPH) 2020-07-02 4

  5. Direct Cache Access (DCA) 1. I/O device DMAs packets to main memory 2. DCA exploits TPH* to prefetch a portion of packets into cache 3. CPU later fetches them from cache Prefetch Still inefficient in terms of memory bandwidth usage • Requires OS intervention and support from processor • I/O Device * PCIe Transaction protocol Processing Hint (TPH) 2020-07-02 5

  6. Intel Data Direct I/O (DDIO) DDIO in Xeon processors since • Xeon E5 DMA packets or descriptors • directly to/from Last Level Cache (LLC) I/O Device 2020-07-02 6

  7. Trends More in-network computing + offloading capabilities Push costly calculations into the network and perform state teful functions at the processor, which makes applications more I/O intensive. 2020-07-02 7

  8. Pressure from these trends Every 6.72 ns a new (64-B+20-B*) packet arrives at 100 Gbps More in-network computing + offloading capabilities Faster link speeds Multi-hundred-gigabit networks cannot tolerate memory access and interarrival time of packets continues to shrin ink * 7B preamble + 1B start-of-frame delimiter +12B inter-frame gap = 20B 2020-07-02 8

  9. DCA matters because Without DCA we are unable to process I/O at line rate, thus increasing packet loss or latency when utilizing multi-hundred-gigabit networks. 2020-07-02 9

  10. Forw arding Packets at 100 Gbps 100 G 100 Gbps Device under Test Packet Forwarding Packets Generator 1400 99 th Percentile Latency (µs) 1200 1000 800 Intel Xeon Gold 6140 600 400 Mellanox ConnectX-5 200 0 Each NIC is placed in 100 Gbps 200 Gbps a PCIe 3.0 16x slot* 100 Rate Gbps * A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth. 2020-07-02 10

  11. What happens at 200 Gbps? When forwarding 2x100 G 00 Gbps at 200 Gbps, 30% higher latency for the NIC forwarding at 100 Gbps Device under Test Packet Forwarding Packets Generator 1400 99 th Percentile Latency (µs) 1200 1000 30% 800 Intel Xeon Gold 6140 600 400 Mellanox ConnectX-5 200 0 Each NIC is placed in 100 Gbps 200 Gbps a PCIe 3.0 16x slot* 100 100 Latency of the first NIC, when Gbps Gbps forwarding at indicated aggregate rate * A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth. 2020-07-02 11

  12. How does DDIO w ork? CPU Socket Writing packets/descriptors: C C C C DDIO overwrites a cache line if if it is already C C C C present in any ny LLC ways ( ≡ write update or hit) C C C C Logical Write to the Same cache line LLC Sending/Receiving Already Packets via DDIO Present In LLC 2020-07-02 12

  13. How does DDIO w ork? CPU Socket Writing packets/descriptors: C C C C DDIO overwrites a cache line if if it is already C C C C present in any ny LLC ways ( ≡ write update or hit) C C C C Otherwise, DDIO allocates a cache line in a limited portion of LLC ( ≡ write allocate or miss) Logical Allocate a cache LLC line Sending/Receiving Not Packets via DDIO Present In LLC 2020-07-02 13

  14. How does DDIO w ork? CPU Socket Writing packets/descriptors: C C C C DDIO overwrites a cache line if if it is already C C C C present in any ny LLC ways ( ≡ write update or hit) C C C C Otherwise, DDIO allocates a cache line in a limited portion of LLC ( ≡ write allocate or miss) Logical Reading packets/descriptors: LLC NIC reads a cache line if it is already present in any LLC ways ( ≡ read hit) Sending/Receiving Packets via DDIO Otherwise, NIC reads it from main memory ( ≡ read miss) 2020-07-02 14

  15. How does DDIO w ork? CPU Socket Designed a set of micro-benchmarks to learn C C C C about DDIO: C C C C • Which ways are used for allocation? C C C C • How does DDIO interact with other applications? Logical • Does DMA via a remote CPU socket LLC pollute LLC? Sending/Receiving Packets via DDIO 2020-07-02 15

  16. LLC w ays used by DDIO I/O Application C0 Logical LLC 1 2 3 4 5 6 7 8 9 10 11 Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology 2020-07-02 16

  17. LLC w ays used by DDIO I/O Cache-sensitive Application Application + C0 C1 Logical LLC 1 2 3 4 5 6 7 8 9 10 11 Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 17

  18. LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 18

  19. LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 19

  20. LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 20

  21. LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 21

  22. LLC w ays used by DDIO 10 I/O Cache-sensitive Contention with code/data causes a rise in Application Application + Sum of Cache Misses (Million) the cache misses of the I/O application 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 22

  23. LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 23

  24. LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 24

  25. LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 25

  26. LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend