Reex amining Direct Cache Access to Optimize I/O Intensive - PowerPoint PPT Presentation

Reex amining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Netw orks Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kostić * KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science (EECS) + Ericsson Research

Traditional I/O 1. I/O device DMAs* packets to main memory 2. CPU later fetches them to cache I/O Device * Direct Memory Access (DMA) 2020-07-02 2

Traditional I/O 1. I/O device DMAs* packets to main memory 2. CPU later fetches them to cache Inefficient: Large number of accesses to main • memory High access latency (>60ns) • Unnecessary memory bandwidth usage • I/O Device * Direct Memory Access (DMA) 2020-07-02 3

Direct Cache Access (DCA) 1. I/O device DMAs packets to main memory 2. DCA exploits TPH* to prefetch a portion of packets into cache 3. CPU later fetches them from cache Prefetch I/O Device * PCIe Transaction protocol Processing Hint (TPH) 2020-07-02 4

Direct Cache Access (DCA) 1. I/O device DMAs packets to main memory 2. DCA exploits TPH* to prefetch a portion of packets into cache 3. CPU later fetches them from cache Prefetch Still inefficient in terms of memory bandwidth usage • Requires OS intervention and support from processor • I/O Device * PCIe Transaction protocol Processing Hint (TPH) 2020-07-02 5

Intel Data Direct I/O (DDIO) DDIO in Xeon processors since • Xeon E5 DMA packets or descriptors • directly to/from Last Level Cache (LLC) I/O Device 2020-07-02 6

Trends More in-network computing + offloading capabilities Push costly calculations into the network and perform state teful functions at the processor, which makes applications more I/O intensive. 2020-07-02 7

Pressure from these trends Every 6.72 ns a new (64-B+20-B*) packet arrives at 100 Gbps More in-network computing + offloading capabilities Faster link speeds Multi-hundred-gigabit networks cannot tolerate memory access and interarrival time of packets continues to shrin ink * 7B preamble + 1B start-of-frame delimiter +12B inter-frame gap = 20B 2020-07-02 8

DCA matters because Without DCA we are unable to process I/O at line rate, thus increasing packet loss or latency when utilizing multi-hundred-gigabit networks. 2020-07-02 9

Forw arding Packets at 100 Gbps 100 G 100 Gbps Device under Test Packet Forwarding Packets Generator 1400 99 th Percentile Latency (µs) 1200 1000 800 Intel Xeon Gold 6140 600 400 Mellanox ConnectX-5 200 0 Each NIC is placed in 100 Gbps 200 Gbps a PCIe 3.0 16x slot* 100 Rate Gbps * A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth. 2020-07-02 10

What happens at 200 Gbps? When forwarding 2x100 G 00 Gbps at 200 Gbps, 30% higher latency for the NIC forwarding at 100 Gbps Device under Test Packet Forwarding Packets Generator 1400 99 th Percentile Latency (µs) 1200 1000 30% 800 Intel Xeon Gold 6140 600 400 Mellanox ConnectX-5 200 0 Each NIC is placed in 100 Gbps 200 Gbps a PCIe 3.0 16x slot* 100 100 Latency of the first NIC, when Gbps Gbps forwarding at indicated aggregate rate * A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth. 2020-07-02 11

How does DDIO w ork? CPU Socket Writing packets/descriptors: C C C C DDIO overwrites a cache line if if it is already C C C C present in any ny LLC ways ( ≡ write update or hit) C C C C Logical Write to the Same cache line LLC Sending/Receiving Already Packets via DDIO Present In LLC 2020-07-02 12

How does DDIO w ork? CPU Socket Writing packets/descriptors: C C C C DDIO overwrites a cache line if if it is already C C C C present in any ny LLC ways ( ≡ write update or hit) C C C C Otherwise, DDIO allocates a cache line in a limited portion of LLC ( ≡ write allocate or miss) Logical Allocate a cache LLC line Sending/Receiving Not Packets via DDIO Present In LLC 2020-07-02 13

How does DDIO w ork? CPU Socket Writing packets/descriptors: C C C C DDIO overwrites a cache line if if it is already C C C C present in any ny LLC ways ( ≡ write update or hit) C C C C Otherwise, DDIO allocates a cache line in a limited portion of LLC ( ≡ write allocate or miss) Logical Reading packets/descriptors: LLC NIC reads a cache line if it is already present in any LLC ways ( ≡ read hit) Sending/Receiving Packets via DDIO Otherwise, NIC reads it from main memory ( ≡ read miss) 2020-07-02 14

How does DDIO w ork? CPU Socket Designed a set of micro-benchmarks to learn C C C C about DDIO: C C C C • Which ways are used for allocation? C C C C • How does DDIO interact with other applications? Logical • Does DMA via a remote CPU socket LLC pollute LLC? Sending/Receiving Packets via DDIO 2020-07-02 15

LLC w ays used by DDIO I/O Application C0 Logical LLC 1 2 3 4 5 6 7 8 9 10 11 Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology 2020-07-02 16

LLC w ays used by DDIO I/O Cache-sensitive Application Application + C0 C1 Logical LLC 1 2 3 4 5 6 7 8 9 10 11 Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 17

LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 18

LLC w ays used by DDIO 10 I/O Cache-sensitive Contention with code/data causes a rise in Application Application + Sum of Cache Misses (Million) the cache misses of the I/O application 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 22

Reex amining Direct Cache Access to Optimize I/O Intensive - PowerPoint PPT Presentation

Reex amining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Netw orks Alireza Farshin * , Amir Roozbeh + , Gerald Q. Maguire Jr. , Dejan Kosti * KTH Royal Institute of Technology, School of Electrical

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Is Is There a Ge Genetic Ba Basi sis t s to R Race? Exam amining human an var ariation

Exam amining M MDD a and M MHD HD as Syntac actic C Complexity M y Meas asures with I

Cache Performance Samira Khan March 28, 2017 Agenda Review from last lecture Cache

Direct Map Cache and Set Associative Cache (Revision) Lecture 14 CDA 3103 07-07-2014 Example 1

MINUTE OPTIMIZE YOUR PH MONITORING OPTIMIZE WITH HAVING CHALLENGES MEASURING

UNIVERSITY OF CALIFORNIA Economics 134 DEPARTMENT OF ECONOMICS Spring 2018 Professor David

(De)Stabilizing the ACAs Insurance Market Mark A. Hall Wake Forest University Brookings

Volusia County, Florida Volusia County recognizes the importance of supporting our local

From Dependency Parsing to Imitation Learning CMSC 723 / LING 723 / INST 725 Marine Carpuat Fig

Knapsack Problem Carola Wenk Slides courtesy of Charles Leiserson with changes and additions by

Creating dummies Nele Verbiest, Ph. D. Senior Data Scientist @ Python Predictions DataCamp

All that glitters is not gold: Zero-point energy in the Johnson noise of resistors L.B. Kish 1 , G.

Slides for Lecture 30 ENEL 353: Digital Circuits Fall 2013 Term Steve Norman, PhD, PEng

Reex amining Direct Cache Access to Optimize I/O Intensive - PowerPoint PPT Presentation

Reex amining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Netw orks Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kosti * KTH Royal Institute of Technology, School of Electrical

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Is Is There a Ge Genetic Ba Basi sis t s to R Race? Exam amining human an var ariation

Exam amining M MDD a and M MHD HD as Syntac actic C Complexity M y Meas asures with I

Cache Performance Samira Khan March 28, 2017 Agenda Review from last lecture Cache

Direct Map Cache and Set Associative Cache (Revision) Lecture 14 CDA 3103 07-07-2014 Example 1

MINUTE OPTIMIZE YOUR PH MONITORING OPTIMIZE WITH HAVING CHALLENGES MEASURING

UNIVERSITY OF CALIFORNIA Economics 134 DEPARTMENT OF ECONOMICS Spring 2018 Professor David

(De)Stabilizing the ACAs Insurance Market Mark A. Hall Wake Forest University Brookings

Volusia County, Florida Volusia County recognizes the importance of supporting our local

From Dependency Parsing to Imitation Learning CMSC 723 / LING 723 / INST 725 Marine Carpuat Fig

Knapsack Problem Carola Wenk Slides courtesy of Charles Leiserson with changes and additions by

Creating dummies Nele Verbiest, Ph. D. Senior Data Scientist @ Python Predictions DataCamp

All that glitters is not gold: Zero-point energy in the Johnson noise of resistors L.B. Kish 1 , G.

Slides for Lecture 30 ENEL 353: Digital Circuits Fall 2013 Term Steve Norman, PhD, PEng

Reex amining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Netw orks Alireza Farshin * , Amir Roozbeh + , Gerald Q. Maguire Jr. , Dejan Kosti * KTH Royal Institute of Technology, School of Electrical